Streaming data pipelines process data continuously as events arrive, rather than accumulating data and processing it in discrete batch windows. In the Iceberg lakehouse context, streaming pipelines use Apache Flink or Apache Spark Structured Streaming to consume events from Kafka, apply transformations, and write results to Iceberg tables in near-real-time, enabling analytics that are minutes (rather than hours) behind operational systems.
Micro-Batch vs. True Streaming
- Micro-Batch (Spark Structured Streaming): Processes events in small, configurable time windows (every 30 seconds to 5 minutes). Simpler to reason about and operationalize than true streaming. Iceberg commit intervals typically align with the micro-batch window. Well-suited for lakehouses where "near-real-time" means minutes, not seconds.
- True Streaming (Apache Flink): Processes each event individually with per-event latency in milliseconds. Flink's Iceberg sink buffers writes and commits snapshots on a configurable interval (typically every 1-5 minutes). Required for use cases where seconds of latency matter (fraud detection, real-time personalization).
Streaming Write Challenges with Iceberg
High-frequency streaming writes create challenges for Iceberg: many small Parquet files accumulate quickly, degrading query performance. Best practice is to let the streaming engine write frequently for freshness but run a parallel compaction process (using Iceberg's RewriteDataFiles procedure) on a scheduled basis to merge the small files into right-sized Parquet files. This separates the freshness concern (handled by the streaming writer) from the query performance concern (handled by compaction).

