Apache Arrow is an open-source, cross-language, columnar in-memory data format specification. Co-founded by engineers at Dremio and other organizations, Arrow was created to solve a pervasive and expensive problem: every data tool had its own internal data representation. Passing data between Pandas, Spark, a C++ query engine, and a database required constant serialization and deserialization into different formats, wasting enormous amounts of CPU time.

The Zero-Copy Promise

Arrow defines a universal memory layout for columnar data. When two Arrow-native tools share data, the receiving tool can read it directly from the memory the first tool allocated, without copying any bytes. A Polars DataFrame created from an Arrow buffer can be passed to a DataFusion query with zero copies and zero serialization overhead. This "zero-copy" interoperability is the primary technical contribution Arrow makes to the data ecosystem.

Arrow's Columnar Layout

Arrow stores data in a columnar format in memory. All values for a column are stored in contiguous memory arrays, typed by the column's data type. This layout is ideal for analytical operations (SELECT a, SUM(b) FROM ...) because the CPU can load an entire column into cache and process it with SIMD vectorized instructions. Most data is skipped entirely if the query only uses a subset of columns.

The Arrow Ecosystem

Arrow has spawned a rich family of complementary projects:

Arrow is the invisible substrate underlying the performance of Dremio, DuckDB, Polars, DataFusion, Pandas 2.0, and dozens of other modern data tools.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon