Apache DataFusion is an open-source, embeddable query execution framework written in Rust. Unlike monolithic query engines such as Spark or Trino that run as separate services you connect to, DataFusion is a library that developers embed directly into their own applications. It provides a complete SQL query planning and execution pipeline that any program can use internally, built on Apache Arrow's in-memory columnar format.
Why DataFusion Matters
DataFusion graduated as an Apache Software Foundation Top-Level Project and has rapidly grown into a foundational building block used by dozens of prominent open-source and commercial data tools. Its key advantages are:
- JVM-Free Performance: Written entirely in Rust, DataFusion eliminates JVM garbage collection pauses and achieves excellent single-machine performance. Benchmarks (including ClickBench) show it achieves top-tier performance for Parquet file querying.
- Extensibility: DataFusion exposes clean interfaces for adding custom functions, data sources, logical plan nodes, and physical execution strategies. This makes it the preferred foundation for building specialized analytical databases and data frameworks.
- Arrow Native: Because DataFusion uses Apache Arrow as its in-memory format natively, data never needs to be serialized or deserialized when integrating with other Arrow-native tools like Polars or PyArrow.
DataFusion in the Lakehouse Ecosystem
DataFusion plays two important roles in the 2026 lakehouse. First, as a standalone engine, Python developers can use the datafusion Python bindings to query Iceberg tables and Parquet files locally with sub-second performance on datasets that fit within a single machine's memory. Second, as an embedded component, DataFusion powers Apache Spark's Comet acceleration plugin and is embedded within tools like InfluxDB IOx and Delta Lake's native Rust library, providing the query execution layer inside systems that need it.

