Columnar Data Formats

A columnar data format is a file format that organizes data by storing all values for a given column contiguously, rather than storing all values for a given row together. This seemingly simple difference in physical layout has profound performance implications for analytical workloads.

Row-Based vs. Columnar Storage

Consider a table with 100 columns and 100 million rows. To compute a simple SELECT country, SUM(revenue) FROM sales GROUP BY country, only two of the 100 columns are needed. In a row-based format (like a traditional RDBMS heap file or CSV), the database must read every row in its entirety, skipping 98 irrelevant column values to extract the 2 needed ones. In a columnar format, the database reads only the contiguous blocks containing the country and revenue columns, completely bypassing the other 98 columns. This dramatically reduces I/O for analytical queries.

Apache Parquet: The Standard

Apache Parquet is the dominant columnar format in the data lakehouse ecosystem, used as the default storage format by Apache Iceberg, Delta Lake, and Apache Hudi. Parquet achieves its efficiency through:

Columnar Layout: Data is stored column-by-column within row groups of configurable size.
Encoding: Columns use value-specific encodings (dictionary encoding for low-cardinality strings, run-length encoding for repetitive values, bit-packing for integers) to significantly reduce file sizes.
Compression: Entire column chunks are compressed using codecs like Snappy, Zstd, or LZ4 after encoding, further reducing storage footprint.
Embedded Statistics: Every column chunk stores min/max value statistics, enabling query engines to skip entire Parquet row groups without reading a single row of data.

Apache ORC

Apache ORC (Optimized Row Columnar) is an alternative columnar format historically popular in Hive and HBase ecosystems. While Parquet has become the standard for open lakehouse architectures, ORC remains widely used in Hive-based workloads and by some BI tools that have legacy ORC reader implementations. Both formats offer similar fundamental performance characteristics.

Row-Based vs. Columnar Storage

Apache Parquet: The Standard

Apache ORC

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Columnar Data Formats

Row-Based vs. Columnar Storage

Apache Parquet: The Standard

Apache ORC

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse