Apache DataFusion is an open-source, embeddable query execution framework written in Rust. Unlike monolithic query engines such as Spark or Trino that run as separate services you connect to, DataFusion is a library that developers embed directly into their own applications. It provides a complete SQL query planning and execution pipeline that any program can use internally, built on Apache Arrow's in-memory columnar format.

Why DataFusion Matters

DataFusion graduated as an Apache Software Foundation Top-Level Project and has rapidly grown into a foundational building block used by dozens of prominent open-source and commercial data tools. Its key advantages are:

DataFusion in the Lakehouse Ecosystem

DataFusion plays two important roles in the 2026 lakehouse. First, as a standalone engine, Python developers can use the datafusion Python bindings to query Iceberg tables and Parquet files locally with sub-second performance on datasets that fit within a single machine's memory. Second, as an embedded component, DataFusion powers Apache Spark's Comet acceleration plugin and is embedded within tools like InfluxDB IOx and Delta Lake's native Rust library, providing the query execution layer inside systems that need it.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon