Apache Kafka is the dominant distributed event streaming platform in enterprise data architectures, serving as the real-time data transport layer between operational systems (applications, IoT devices, databases, CDC pipelines) and the data lakehouse. In modern lakehouse architectures, Kafka sits between event sources and Iceberg tables, providing durability, ordering guarantees, and replay capability for streaming data.
Kafka's Role in the Lakehouse
Kafka functions as a high-throughput, fault-tolerant message queue. Events (user clicks, sensor readings, order events, database change events from Debezium CDC) are published to Kafka topics. Multiple consumers can read from the same topic at different speeds and with different processing logic, enabling parallel consumption without data loss.
For the Iceberg lakehouse, the key consumers are Apache Flink and Apache Spark Structured Streaming jobs that read from Kafka and write to Iceberg in micro-batches (every 1-5 minutes) or with true streaming semantics. This pattern provides the freshness needed for near-real-time analytics while maintaining Iceberg's ACID guarantees.
Kafka Connect and Iceberg Sink
Kafka Connect provides a managed framework for connectors. The Iceberg Sink Connector for Kafka Connect (maintained by Tabular and the community) enables writing Kafka topics directly to Iceberg tables without writing custom Flink or Spark jobs. The connector handles schema mapping from Kafka Avro or JSON schemas to Iceberg schemas, batching events into Parquet files, and committing Iceberg snapshots on a configurable interval. This low-code approach is well-suited for organizations that want Kafka-to-Iceberg pipelines without deep streaming expertise.

