Streaming vs Batch Processing

Every data pipeline ultimately makes a choice between two processing models: process events continuously as they arrive (streaming), or accumulate events over a defined time window and process them all at once (batch). Both models are valid. The right choice depends on the latency requirements, data volume, transformation complexity, and cost tolerance of the specific use case.

Batch Processing

Batch processing accumulates data over a time window (hourly, daily, weekly) and processes the entire accumulated dataset as a single job. Apache Spark is the dominant batch processing engine in the lakehouse ecosystem. Batch jobs are conceptually simpler than streaming pipelines: the input is a bounded, static dataset, transformations run to completion, and the output is written atomically. Failures restart the whole job from the beginning using the saved checkpoint, so error recovery is straightforward.

Batch is the right model for complex transformations that require access to the full dataset (sort-based operations, window functions over large time ranges, machine learning feature engineering), for large-volume historical backfills, and for workloads where hourly or daily data freshness is sufficient. Most gold-tier lakehouse tables serving BI dashboards and AI agent queries are populated by nightly batch jobs.

Streaming Processing

Streaming processes events as they arrive, maintaining continuous state across an unbounded input stream. Apache Flink is the leading stream processing engine for production lakehouse pipelines. Flink supports event-time processing (using the timestamp embedded in the event rather than the ingestion timestamp), windowed aggregations over sliding and tumbling time windows, and stateful joins between streams.

Streaming is appropriate when data freshness requirements are tighter than a batch pipeline can satisfy: fraud detection, real-time recommendation updates, operational dashboards, and live quality monitoring. The cost is higher operational complexity: state management, watermark handling for out-of-order events, and checkpoint tuning require deeper expertise than batch job configuration.

Lambda and Kappa Architectures

The Lambda architecture combines both models: a batch layer recomputes historically correct results nightly, and a streaming speed layer provides low-latency approximate results for recent data. Queries merge both layers. The operational cost of maintaining two separate systems (batch and streaming code for the same transformations) made Lambda architecture unpopular in practice.

The Kappa architecture simplifies this by using streaming exclusively. Historical reprocessing is handled by replaying the event log through the streaming engine. Apache Iceberg's support for both streaming writes from Flink and batch writes from Spark, on the same table, enables a practical middle path: use streaming ingestion into bronze-tier tables and batch transformations for silver and gold tiers. The latency properties of each tier are matched to its transformations without maintaining two separate code paths.

Batch Processing

Streaming Processing

Lambda and Kappa Architectures

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Streaming vs Batch Processing

Batch Processing

Streaming Processing

Lambda and Kappa Architectures

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone