The Apache Iceberg specification defines a precise, three-layer architecture that organizes data files stored in object storage into a fully ACID-compliant, versionable table structure. Understanding this architecture explains why Iceberg achieves efficient query planning, concurrent read/write safety, and reliable time travel without requiring a stateful server process to manage the table state.
Layer 1: The Catalog
The catalog maintains a single pointer per table: the URI of the table's current metadata file. When a query engine needs to query an Iceberg table, it contacts the catalog with the table name and receives back the metadata file location. This pointer update is the atomic operation at the heart of Iceberg's transactionality.
When a writer commits a new snapshot, it creates a new metadata file and attempts to atomically update the catalog's pointer from the old metadata file to the new one. If two writers attempt to commit simultaneously, only one succeeds. The other detects the conflict (because the metadata pointer has changed since it read it) and must retry from the current state. This is Iceberg's Optimistic Concurrency Control mechanism.
Catalog implementations include the Iceberg REST Catalog (a vendor-neutral HTTP spec), Apache Polaris (an open-source implementation of the REST spec), AWS Glue, Apache Hive Metastore, Project Nessie (which adds Git-like branch semantics), and JDBC-backed catalogs for Postgres or MySQL.
Layer 2: The Metadata Layer
The metadata layer has three tiers of files:
- Metadata files (JSON): These contain the table's schema history, partition spec history, and a list of snapshots. Every schema evolution, partition evolution, or property change creates a new metadata file. The metadata file is the starting point for reading a specific version of the table.
- Manifest lists (Avro): Each snapshot has one manifest list that records all the manifest files belonging to that snapshot, along with partition-level statistics summarizing the contents of each manifest. The manifest list enables partition pruning at the snapshot level before reading any manifest files.
- Manifest files (Avro): Each manifest file lists a set of data files along with their column-level statistics: min value, max value, null count, and value count for each column. These statistics are what enables data file pruning during query planning. A query filtering on dates in 2024 can skip manifest entries for data files whose max date is before 2024 without reading those files at all.
Layer 3: The Data Layer
The data layer consists of immutable Parquet (or ORC/Avro) files. These files are never modified after they are written. Updates and deletes are handled by writing new files and recording the changes in the metadata layer. In Spec v2, delete files (either positional delete files recording which row positions in a data file are deleted, or equality delete files recording which record keys are deleted) are separate files tracked by the metadata layer alongside the data files.
Why This Architecture Performs
The key performance insight is that query planning never requires listing object storage directories. A Hive-based table on S3 requires listing every object under the table prefix to discover which files to read, which is expensive at scale (S3 LIST operations have meaningful latency and cost). Iceberg's query planner reads the metadata hierarchy directly, following the pointer chain from the metadata file to the relevant manifest files, applying partition pruning at each level. For a petabyte-scale table with millions of files, this reduces query planning from minutes to seconds.



