Apache Iceberg

Apache Iceberg is an open table format specification for large-scale analytical datasets stored in cloud object storage. It was originally developed at Netflix to solve reliability and performance problems with Hive-based tables at petabyte scale, open-sourced in 2018, and donated to the Apache Software Foundation where it became a top-level project in 2020. By 2024-2025, Iceberg had become the dominant open table format standard, supported natively by Apache Spark, Apache Flink, Dremio, Trino, Snowflake, BigQuery, and dozens of other engines and platforms.

Iceberg is not a storage format. The actual data files are stored in Parquet (most commonly), ORC, or Avro. Iceberg is the metadata and transaction layer that sits above those data files and organizes them into a logical table with ACID semantics, versioned history, and rich partition management.

The Three-Layer Architecture

The Iceberg specification defines three layers:

Catalog layer: Tracks the current pointer to each table's active metadata file. The catalog is what a query engine contacts first to find the current state of a table. Implementations include Apache Polaris (REST catalog), AWS Glue, Project Nessie, and Hive Metastore. The catalog update is the atomic operation that makes table writes transactional: the metadata file pointer is updated atomically using optimistic concurrency control, ensuring that concurrent writers cannot create inconsistent states.
Metadata layer: Contains a hierarchy of files: a metadata JSON file pointing to one or more manifest lists, each manifest list pointing to multiple manifest files, and each manifest file listing the data files (Parquet files) with their statistics. This hierarchical structure enables query planning without listing directories in object storage, which is the key performance advantage over Hive's directory-scanning approach.
Data layer: The actual immutable Parquet (or ORC/Avro) files. These files are never modified in-place; writes create new files, and Iceberg's commit protocol atomically updates the metadata pointer to include the new files and exclude any files replaced by an update.

Why Iceberg Won

Iceberg's engine-neutrality is its primary competitive advantage. Unlike Delta Lake, which originated in the Databricks/Spark ecosystem, Iceberg was designed from the start as a vendor-neutral specification. Any engine can implement the spec without licensing fees or proprietary dependencies. This made it the natural choice for organizations that want to use different engines for different workloads (Spark for batch transformation, Flink for streaming ingestion, Dremio for interactive BI queries) without being locked into one vendor's tools.

Hidden Partitioning (Iceberg's mechanism for automatically applying partition transforms without requiring analysts to include partition columns in their queries) was a significant usability improvement over Hive partitioning, where incorrect queries that missed partition columns caused full table scans silently.