Change Data Capture (CDC) is a design pattern for identifying and tracking changes (inserts, updates, deletes) made to a source database and delivering those changes as a continuous stream to downstream consumers. In the lakehouse context, CDC is the primary mechanism for keeping Iceberg tables synchronized with operational OLTP databases (PostgreSQL, MySQL, Oracle, SQL Server) in near-real-time, without expensive full table extracts.

How CDC Works

The most reliable CDC implementations read from the database's transaction log (the WAL in PostgreSQL, the binary log in MySQL) rather than polling the tables with queries. The transaction log contains an ordered record of every committed change, including the before and after values for updates and the operation type (INSERT, UPDATE, DELETE).

Debezium, the dominant open-source CDC tool, connects to the database transaction log and publishes change events as JSON messages to Apache Kafka topics. Each event represents a single row change with full context: the table name, operation type, timestamp, and the row's new values. A downstream Apache Flink or Spark job consumes these Kafka events and applies them as MERGE operations to Iceberg tables, implementing upsert and delete semantics.

CDC and Apache Iceberg MERGE

Iceberg's ACID MERGE INTO statement is the natural target for CDC events. A Flink job consuming Debezium events can express the upsert logic as:

Iceberg's support for delete files (introduced in the v2 spec) and MERGE INTO means CDC pipelines can maintain a perfectly synchronized replica of a production database in the lakehouse, enabling analytical queries on nearly-current operational data without touching the source database.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon