Apache Flink is an open-source, unified stream and batch processing framework. While Apache Spark treats streaming as a series of micro-batches (a batch processing model), Flink was designed from the ground up as a true stateful streaming engine where each event is processed individually as it arrives. In 2026, the combination of Apache Flink and Apache Iceberg has matured into the standard architecture for building real-time streaming data lakehouses.
Flink's Iceberg Integration
The Flink-Iceberg connector enables Flink jobs to write directly to Iceberg tables on object storage. Key aspects of this integration include:
- Exactly-Once Guarantees: Flink aligns its checkpoint boundaries with Iceberg snapshot commits. If a Flink job fails and recovers from a previous checkpoint, partial data is automatically rolled back. This ensures no event is written more than once, even through job failures.
- CDC and Upserts: Using Iceberg's row-level delete capabilities (from the v2 spec), Flink can perform reliable Change Data Capture (CDC) and upsert operations, writing positional and equality delete files that other engines can efficiently resolve at query time.
- Dynamic Table Routing: Advanced Flink pipelines use the Dynamic Iceberg Sink to automatically route events from a single source to multiple Iceberg tables, handling schema evolution and new table creation without job restarts.
The Streaming-Compaction Cycle
A critical operational reality of Flink-to-Iceberg streaming is the small file problem. Every checkpoint commit (which might occur every 1-5 minutes) creates new Parquet files. Production deployments must pair the Flink streaming job with a background compaction service that continuously merges these small files into optimally-sized blocks. Batching writes into 1-to-5-minute intervals reduces the number of commits by orders of magnitude compared to per-second commits while still providing near-real-time data availability for most dashboard use cases.

