Streaming data pipelines process data continuously as events arrive, rather than accumulating data and processing it in discrete batch windows. In the Iceberg lakehouse context, streaming pipelines use Apache Flink or Apache Spark Structured Streaming to consume events from Kafka, apply transformations, and write results to Iceberg tables in near-real-time, enabling analytics that are minutes (rather than hours) behind operational systems.

Micro-Batch vs. True Streaming

Streaming Write Challenges with Iceberg

High-frequency streaming writes create challenges for Iceberg: many small Parquet files accumulate quickly, degrading query performance. Best practice is to let the streaming engine write frequently for freshness but run a parallel compaction process (using Iceberg's RewriteDataFiles procedure) on a scheduled basis to merge the small files into right-sized Parquet files. This separates the freshness concern (handled by the streaming writer) from the query performance concern (handled by compaction).

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon