Iceberg V2 Spec

The Apache Iceberg format is defined by a strict specification that guarantees how metadata and data files interact, ensuring compatibility across the myriad of engines that read and write to the lakehouse. While Spec V1 focused on analytical batch processing, schema evolution, and massive scalability using the Copy-on-Write methodology, the release of the Iceberg V2 Spec marked a major evolutionary leap for the format.

The Problem with Spec V1

Under Spec V1, data files in object storage were immutable. To delete or update a single row, a query engine had to read the entire data file, remove or alter the target row in memory, and rewrite a brand-new data file to storage. This Copy-on-Write (CoW) approach provided incredibly fast read speeds but created massive write amplification. It was fundamentally unsuited for Change Data Capture (CDC) pipelines or high-frequency streaming workloads, where small, rapid updates would quickly overwhelm the compute clusters tasked with rewriting the data files.

The Innovation of Spec V2

The defining feature of the Iceberg V2 Spec is the introduction of Row-Level Deletes. This fundamentally changes how updates and deletes are handled by enabling a Merge-on-Read (MoR) strategy. Under MoR, the original data file remains untouched. Instead, the engine writes a small "delete file" indicating which rows should be ignored. The merging of the base data and the deleted rows happens dynamically at query time.

Spec V2 introduced two types of delete files to support this:

Position Deletes: Records the exact file path and row index of the deleted data. This is highly efficient for read engines, but requires the writer to know exactly where the data physically resides.
Equality Deletes: Records the logical condition of the delete (e.g., id = 123). This is computationally expensive for read engines to evaluate, but incredibly fast for streaming ingestion engines (like Apache Flink) to write, as they do not need to scan the lake to find the data's location.

Sequence Numbers

To safely manage these delete files without corrupting data, Spec V2 introduced Sequence Numbers. Every data file and delete file is assigned a sequence number based on when the snapshot was committed. During a read, an equality delete file only applies to data files that possess a lower or equal sequence number. This guarantees that if a user deletes "Customer A" at sequence 10, and then "Customer A" creates a new account at sequence 15, the delete file from sequence 10 will not accidentally erase the new record.

Impact on the Lakehouse

By solving the streaming ingestion problem, the Iceberg V2 Spec expanded the lakehouse from a batch-oriented analytical repository into a real-time platform capable of replacing traditional, expensive data warehouses for almost all operational and analytical workloads.

The Problem with Spec V1

The Innovation of Spec V2

Sequence Numbers

Impact on the Lakehouse

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Iceberg V2 Spec

The Problem with Spec V1

The Innovation of Spec V2

Sequence Numbers

Impact on the Lakehouse

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone