The Apache Iceberg specification defines a precise, three-layer architecture that organizes data files stored in object storage into a fully ACID-compliant, versionable table structure. Understanding this architecture explains why Iceberg achieves efficient query planning, concurrent read/write safety, and reliable time travel without requiring a stateful server process to manage the table state.

Layer 1: The Catalog

The catalog maintains a single pointer per table: the URI of the table's current metadata file. When a query engine needs to query an Iceberg table, it contacts the catalog with the table name and receives back the metadata file location. This pointer update is the atomic operation at the heart of Iceberg's transactionality.

When a writer commits a new snapshot, it creates a new metadata file and attempts to atomically update the catalog's pointer from the old metadata file to the new one. If two writers attempt to commit simultaneously, only one succeeds. The other detects the conflict (because the metadata pointer has changed since it read it) and must retry from the current state. This is Iceberg's Optimistic Concurrency Control mechanism.

Catalog implementations include the Iceberg REST Catalog (a vendor-neutral HTTP spec), Apache Polaris (an open-source implementation of the REST spec), AWS Glue, Apache Hive Metastore, Project Nessie (which adds Git-like branch semantics), and JDBC-backed catalogs for Postgres or MySQL.

Layer 2: The Metadata Layer

The metadata layer has three tiers of files:

Layer 3: The Data Layer

The data layer consists of immutable Parquet (or ORC/Avro) files. These files are never modified after they are written. Updates and deletes are handled by writing new files and recording the changes in the metadata layer. In Spec v2, delete files (either positional delete files recording which row positions in a data file are deleted, or equality delete files recording which record keys are deleted) are separate files tracked by the metadata layer alongside the data files.

Why This Architecture Performs

The key performance insight is that query planning never requires listing object storage directories. A Hive-based table on S3 requires listing every object under the table prefix to discover which files to read, which is expensive at scale (S3 LIST operations have meaningful latency and cost). Iceberg's query planner reads the metadata hierarchy directly, following the pointer chain from the metadata file to the relevant manifest files, applying partition pruning at each level. For a petabyte-scale table with millions of files, this reduces query planning from minutes to seconds.

Master the Agentic Lakehouse

Start building today with free trials and authoritative resources.

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon
Apache Iceberg and Agentic AI

Apache Iceberg and Agentic AI

Buy on Amazon
Lakehouse Built for Everyone

Lakehouse Built for Everyone

Buy on Amazon