A common concern about the Data Lakehouse is whether querying Parquet files in S3 can match the sub-second performance of a purpose-built cloud data warehouse. The answer depends entirely on which performance techniques are applied. An unoptimized Lakehouse (raw S3 files, no partitioning, no caching, no acceleration) is slow. A properly tuned Lakehouse with modern execution engine features routinely matches or exceeds data warehouse performance at a fraction of the storage cost.

Partition Pruning

Partition pruning is the single most impactful performance technique in the Lakehouse. When an Iceberg table is partitioned by date (for example, one directory per day), a query filtering for last week's data only needs to scan seven partition directories out of potentially thousands. The Iceberg manifest files contain the partition range statistics for each data file, allowing the query engine to identify exactly which files to read without opening any of them. Well-chosen partition columns can reduce scan volumes by 99% on common analytical queries.

Query Acceleration with Dremio Reflections

Dremio Reflections are pre-computed, materialized query results stored internally in Arrow format. When Dremio detects that a new query matches (or partially matches) an existing reflection, it serves the result from the reflection rather than re-scanning the source Iceberg files. A reflection over a commonly used daily revenue aggregate might reduce a query that scans 500 GB of raw data to a millisecond read from a 10 MB pre-computed result. Reflections are created declaratively and maintained automatically; the query engine handles invalidation when the underlying Iceberg data changes.

Columnar Vectorized Execution

Dremio's Gandiva execution engine uses Apache Arrow's columnar in-memory format and LLVM-compiled vectorized operators to process data using SIMD CPU instructions. Instead of evaluating predicates row-by-row, the engine evaluates them on batches of thousands of values at once using the CPU's vector registers. This approach saturates CPU caches with column data rather than row data, which is the access pattern that modern CPUs handle most efficiently. The result is an order-of-magnitude throughput improvement over interpreted row-at-a-time engines for aggregation-heavy analytical queries.

File Compaction and Z-Ordering

Small Parquet files are the enemy of Lakehouse performance. Each file requires an S3 GET request, and thousands of tiny files generated by streaming ingestion pipelines produce thousands of requests for what might be a small amount of actual data. Iceberg's compaction operation merges small files into optimally sized larger files (typically 256 MB to 1 GB). Z-Ordering (also called clustering in some tools) physically co-locates related rows within files to improve predicate pushdown effectiveness for multi-column filter patterns that do not align with the primary partition key.

Master the Agentic Lakehouse

Start building today with free trials and authoritative resources.

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon
Apache Iceberg and Agentic AI

Apache Iceberg and Agentic AI

Buy on Amazon
Lakehouse Built for Everyone

Lakehouse Built for Everyone

Buy on Amazon