Apache Spark is the most widely adopted distributed processing framework in the data engineering world. Originally developed at UC Berkeley's AMPLab in 2009, Spark replaced Hadoop MapReduce by keeping intermediate results in memory rather than writing each step back to disk, achieving significant performance improvements for iterative computations. In the modern lakehouse, Spark remains the dominant engine for heavy-duty ETL and machine learning pipelines.

Spark and Apache Iceberg

Apache Spark has the deepest and most mature integration with Apache Iceberg of any compute engine. The official Iceberg Spark runtime library enables full table procedures like CALL catalog.system.rewrite_data_files(), expire_snapshots(), complete DML support (MERGE INTO, UPDATE, DELETE), and Structured Streaming writes with exactly-once guarantees. Spark is the primary interface for running Iceberg maintenance operations at scale.

Spark's Role in the Multi-Engine Lakehouse

In an interoperable lakehouse, Spark handles the data engineering layer: processing raw bronze data through medallion layers, running scheduled machine learning training jobs, and performing large-scale data migrations. Interactive BI queries are typically served by faster, more concurrent engines like Trino, Dremio, or DuckDB reading the same Iceberg tables Spark has written.

DataFusion Comet

A notable 2025/2026 development is Apache DataFusion Comet, which replaces Spark's JVM-based execution engine with the high-performance Rust-native DataFusion execution engine as a native plugin. Maturing through 2026, Comet delivers dramatic performance improvements for common Spark SQL workloads while maintaining full API compatibility.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon