Change Data Capture (CDC)

Change Data Capture (CDC) is a design pattern for identifying and tracking changes (inserts, updates, deletes) made to a source database and delivering those changes as a continuous stream to downstream consumers. In the lakehouse context, CDC is the primary mechanism for keeping Iceberg tables synchronized with operational OLTP databases (PostgreSQL, MySQL, Oracle, SQL Server) in near-real-time, without expensive full table extracts.

How CDC Works

The most reliable CDC implementations read from the database's transaction log (the WAL in PostgreSQL, the binary log in MySQL) rather than polling the tables with queries. The transaction log contains an ordered record of every committed change, including the before and after values for updates and the operation type (INSERT, UPDATE, DELETE).

Debezium, the dominant open-source CDC tool, connects to the database transaction log and publishes change events as JSON messages to Apache Kafka topics. Each event represents a single row change with full context: the table name, operation type, timestamp, and the row's new values. A downstream Apache Flink or Spark job consumes these Kafka events and applies them as MERGE operations to Iceberg tables, implementing upsert and delete semantics.

CDC and Apache Iceberg MERGE

Iceberg's ACID MERGE INTO statement is the natural target for CDC events. A Flink job consuming Debezium events can express the upsert logic as:

When the CDC event is an INSERT or UPDATE: MERGE INTO iceberg_table WHEN MATCHED THEN UPDATE, WHEN NOT MATCHED THEN INSERT
When the CDC event is a DELETE: MERGE INTO iceberg_table WHEN MATCHED THEN DELETE

Iceberg's support for delete files (introduced in the v2 spec) and MERGE INTO means CDC pipelines can maintain a perfectly synchronized replica of a production database in the lakehouse, enabling analytical queries on nearly-current operational data without touching the source database.

How CDC Works

CDC and Apache Iceberg MERGE

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Change Data Capture (CDC)

How CDC Works

CDC and Apache Iceberg MERGE

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse