Iceberg Catalog

At the highest level of the Apache Iceberg architecture sits the Catalog. You can think of the Catalog as the "front desk" of your data lakehouse. When a query engine (like Spark, Flink, or Trino) wants to read from or write to an Iceberg table, it cannot simply scan an object storage bucket. It must first ask the Catalog where the table's current metadata is located.

The Role of the Catalog

The core responsibility of the Iceberg Catalog is to store a mapping between a logical table name (e.g., marketing.campaign_results) and the absolute URI of its most recent JSON Metadata File in object storage. The Catalog does not store the data files, nor does it store the bulk of the metadata (like manifest lists or manifest files). It simply stores the pointer to the root of the metadata tree.

Enabling ACID Transactions

Because object storage systems (like Amazon S3 or Google Cloud Storage) are designed for eventual consistency and do not natively support atomic file updates or multi-file transactions, Iceberg delegates the transaction locking mechanism to the Catalog layer.

When an engine finishes writing new data, it creates a new metadata JSON file. To finalize the commit, the engine asks the Catalog to swap the pointer from the old metadata file to the new one. The Catalog must provide an atomic Compare-and-Swap (CAS) operation. If the pointer has not changed since the engine started its transaction, the swap succeeds. If another engine changed the pointer first, the catalog rejects the commit, forcing the losing engine to retry. This is how Iceberg guarantees ACID compliance across concurrent writers.

Types of Catalogs

Iceberg's modular design means organizations can plug in different catalog implementations depending on their infrastructure:

REST Catalogs: Modern, standard implementations like Apache Polaris or managed services that communicate over the open Iceberg REST API, preventing vendor lock-in.
Hive Metastore (HMS): A legacy catalog often used by organizations transitioning from Hadoop ecosystems to Iceberg.
Cloud-Native Catalogs: Managed services like AWS Glue Catalog that natively integrate with cloud IAM and analytics ecosystems.
Database/JDBC Catalogs: Using a relational database (like PostgreSQL or MySQL) to store the table pointers, often used for local development or specialized custom deployments.

The Role of the Catalog

Enabling ACID Transactions

Types of Catalogs

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Iceberg Catalog

The Role of the Catalog

Enabling ACID Transactions

Types of Catalogs

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone