Open Table Formats

A data lake storing raw Parquet files in S3 is just organized file storage. Without a table format layer, there are no ACID transaction guarantees, no consistent view of the data for concurrent readers and writers, no schema enforcement, and no efficient mechanism for time travel or rollback. Open Table Formats are the specification layer that adds all of these capabilities above the raw data files without moving the files into a proprietary system.

The three major open table formats are Apache Iceberg, Delta Lake, and Apache Hudi. All three solve the same core problem: giving object-storage data files the reliability properties of a traditional database. Each took a different architectural approach and ended up with different strengths.

The Problem They Solve

Before open table formats, data lakes suffered from predictable problems. Without transaction control, a writer crashing midway through an update left the table in a partially written, inconsistent state. Without a versioning mechanism, there was no way to query the data as it existed at a point in the past. Schema changes required either full table rewrites or careful backward-compatibility management by hand. Partition management required analysts to know the physical partition structure and include it in their queries to avoid full scans.

Each of these problems has a real operational cost. Inconsistent table state after failed writes causes downstream pipeline failures. Lack of time travel makes debugging data quality issues harder. Manual schema management creates fragile pipelines that break on upstream changes. Open table formats address all of these through metadata management above the Parquet layer.

The Three Main Implementations

Apache Iceberg is governed by the Apache Software Foundation. Its architecture uses a hierarchical metadata tree (snapshots pointing to manifest lists, which point to manifest files, which list data files). It was designed for engine neutrality from the start, and as of 2024-2025 it is supported by more engines than any other format, including Spark, Flink, Dremio, Trino, Snowflake, BigQuery, and others. Iceberg's hidden partitioning and partition evolution features are significant usability advantages.
Delta Lake is governed by the Linux Foundation, with Databricks as the primary contributor. It uses a transaction log (the _delta_log directory) of JSON and Parquet checkpoint files to track table state. It integrates most deeply with Apache Spark and the Databricks platform. Databricks' UniForm feature allows Delta tables to be read by Iceberg-compatible engines.
Apache Hudi is governed by the Apache Software Foundation. It has the strongest support for record-level upserts and deletes, making it a common choice for Change Data Capture pipelines that need to apply individual record changes at high frequency. Hudi's timeline-based architecture and flexible indexing strategies are distinct from Iceberg's snapshot approach.

The Convergence Trend

By 2025, Apache Iceberg had achieved broad enough adoption that Delta Lake and Hudi both added compatibility layers for reading Iceberg-format tables or exposing their tables through Iceberg-compatible metadata. The industry is converging on Iceberg's catalog interface (the Iceberg REST Catalog specification) as the standard for engine-to-catalog communication, even for tables stored in other formats.

The Problem They Solve

The Three Main Implementations

The Convergence Trend

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Open Table Formats

The Problem They Solve

The Three Main Implementations

The Convergence Trend

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone