A Cost-Based Optimizer (CBO) is the component within a query engine that analyzes multiple possible execution strategies for a SQL query and selects the one with the lowest estimated resource cost. Without a CBO, a query engine would rely on simple rule-based heuristics (always join the first table to the second, always use a hash join), often producing dramatically suboptimal execution plans for complex analytical workloads.

How a CBO Works

When a SQL query arrives, the CBO goes through several phases:

Iceberg Statistics and the CBO

Apache Iceberg's rich metadata layer is a significant asset for CBOs. The min/max statistics stored at the column level within each Parquet file enable the optimizer to estimate that a filter like WHERE country = 'US' will eliminate 80% of a table's files before scanning begins. This allows the CBO to accurately model the actual data volume each join operator will receive, enabling far more accurate join ordering decisions on large, skewed datasets that a simple rule-based approach would get wrong.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon