To leverage Apache Iceberg effectively—especially for high-concurrency environments or Agentic Analytics—data engineers must understand the exact mechanics of its architecture. Iceberg is not an execution engine, nor is it a storage engine. It is an open specification for a table format, implemented via a series of strictly governed metadata files.
This architectural deep dive breaks down the metadata tree, the write/commit lifecycle, query planning optimizations, and how table state is safely mutated across distributed systems.
The Metadata Tree
The defining characteristic of Iceberg is that it tracks data at the file level rather than the directory level. This state is managed through a hierarchical tree of metadata. When a query engine reads an Iceberg table, it traverses this tree from the top node down to the physical data files.
1. The Catalog
The Catalog is the entry point for every Iceberg table. Its sole responsibility is to store a mapping from a table name (e.g., db.sales) to the URI of the table's current Metadata JSON file. The Catalog is the only component in the Iceberg architecture that must support atomic operations (such as a database compare-and-swap). Modern catalogs include Apache Polaris, REST Catalogs, and AWS Glue.
2. Table Metadata JSON
The Metadata JSON file is the source of truth for the table at a specific point in time. It contains:
- Schema: The current schema, utilizing unique integer IDs for columns to enable safe schema evolution without renaming issues.
- Partition Spec: The rules defining how data is partitioned (enabling Hidden Partitioning).
- Snapshots Array: A history of the table's state.
- Current Snapshot ID: A pointer to the exact snapshot that represents the "live" state of the table.
3. Snapshots and Manifest Lists
A Snapshot represents the complete set of data files that belong to a table at the exact moment the snapshot was created. Every Snapshot points to a single Manifest List file (stored in Avro format).
The Manifest List is an index of Manifest Files. It contains metadata about each Manifest File, including the partition boundaries of the files it tracks. This allows the query engine to completely ignore manifests that do not match the user's query predicates, drastically reducing the search space.
4. Manifest Files
Also stored in Avro format, Manifest Files track the actual underlying data files (Parquet, ORC, or Avro). Each record in a Manifest File contains:
- The absolute URI of the physical data file.
- The partition data for that file.
- Column-level statistics (min/max values, null counts, NaN counts).
These statistics are the secret to Iceberg's performance: an engine can read the Manifest File and determine that a specific Parquet file contains no data matching the query filter, skipping the Parquet file entirely.
The Write Lifecycle and Optimistic Concurrency
One of the most dangerous operations in a traditional data lake is multiple engines attempting to write to the same folder simultaneously. Iceberg solves this using Optimistic Concurrency Control (OCC).
When an ingestion job (e.g., an Apache Spark streaming job) wants to append data to an Iceberg table, the sequence is as follows:
- Read Current State: The writer asks the Catalog for the current Metadata JSON. It notes the current Snapshot ID (e.g., `Snapshot_V1`).
- Write Data: The writer physically writes new Parquet files to object storage. These files are orphaned; no one can see them yet.
- Write Metadata: The writer creates a new Manifest File tracking the new Parquet files, a new Manifest List incorporating the new and old manifests, and a proposed Metadata JSON file (`Metadata_V2`) pointing to a new snapshot (`Snapshot_V2`).
- The Commit: The writer asks the Catalog to atomically swap the table pointer from `Metadata_V1` to `Metadata_V2`.
If another writer committed `Metadata_V1_B` while our writer was generating files, the Catalog rejects our writer's commit. Our writer catches the exception, re-reads the new current state (`Metadata_V1_B`), merges its new Manifest into a new proposed state (`Metadata_V3`), and retries the atomic commit. This guarantees ACID transactions across distributed engines without locking files.
Retention and Garbage Collection
Because Iceberg never overwrites data—it only creates new snapshots and new metadata files—the storage footprint of an active table will grow indefinitely. This enables powerful features like Time Travel (querying the table as it looked exactly three days ago) and Rollbacks (reverting an accidental DROP TABLE).
However, this requires maintenance. Data engineers must run periodic table services:
- Expire Snapshots: Removes metadata references to snapshots older than a specific retention period (e.g., 7 days).
- Delete Orphan Files: Physically deletes the underlying Parquet files that are no longer referenced by any valid snapshot.
- Compaction: Rewrites thousands of tiny Parquet files into optimally sized larger files (typically 128MB to 512MB) to improve read performance.
By mastering the Iceberg metadata tree and its operational lifecycle, engineering teams can build resilient, high-performance foundations that are fully capable of supporting the rigorous demands of Agentic AI.