A columnar data format is a file format that organizes data by storing all values for a given column contiguously, rather than storing all values for a given row together. This seemingly simple difference in physical layout has profound performance implications for analytical workloads.

Row-Based vs. Columnar Storage

Consider a table with 100 columns and 100 million rows. To compute a simple SELECT country, SUM(revenue) FROM sales GROUP BY country, only two of the 100 columns are needed. In a row-based format (like a traditional RDBMS heap file or CSV), the database must read every row in its entirety, skipping 98 irrelevant column values to extract the 2 needed ones. In a columnar format, the database reads only the contiguous blocks containing the country and revenue columns, completely bypassing the other 98 columns. This dramatically reduces I/O for analytical queries.

Apache Parquet: The Standard

Apache Parquet is the dominant columnar format in the data lakehouse ecosystem, used as the default storage format by Apache Iceberg, Delta Lake, and Apache Hudi. Parquet achieves its efficiency through:

Apache ORC

Apache ORC (Optimized Row Columnar) is an alternative columnar format historically popular in Hive and HBase ecosystems. While Parquet has become the standard for open lakehouse architectures, ORC remains widely used in Hive-based workloads and by some BI tools that have legacy ORC reader implementations. Both formats offer similar fundamental performance characteristics.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon