A Cost-Based Optimizer (CBO) is the component within a query engine that analyzes multiple possible execution strategies for a SQL query and selects the one with the lowest estimated resource cost. Without a CBO, a query engine would rely on simple rule-based heuristics (always join the first table to the second, always use a hash join), often producing dramatically suboptimal execution plans for complex analytical workloads.
How a CBO Works
When a SQL query arrives, the CBO goes through several phases:
- Parse and Bind: The query is parsed into a logical representation and table/column references are resolved against the catalog.
- Statistics Collection: The CBO reads table statistics (row counts, column cardinality, value distributions) from the catalog. For Apache Iceberg tables, this includes the per-column min/max statistics embedded in every manifest file.
- Plan Enumeration: The CBO generates dozens or hundreds of alternative logical plan trees (different join orders, different join algorithms, different scan approaches).
- Cost Estimation: For each candidate plan, the CBO estimates the total cost in terms of I/O, network transfer, and CPU operations, using the table statistics to estimate intermediate result set sizes.
- Plan Selection: The plan with the lowest estimated cost is selected for physical execution.
Iceberg Statistics and the CBO
Apache Iceberg's rich metadata layer is a significant asset for CBOs. The min/max statistics stored at the column level within each Parquet file enable the optimizer to estimate that a filter like WHERE country = 'US' will eliminate 80% of a table's files before scanning begins. This allows the CBO to accurately model the actual data volume each join operator will receive, enabling far more accurate join ordering decisions on large, skewed datasets that a simple rule-based approach would get wrong.

