In Apache Iceberg, data skipping is powered by the min/max statistics stored in the manifest files. If data is randomly scattered across thousands of Parquet files, a query engine has to open and scan almost all of them because the min/max range of every file will likely overlap with the query's filter. To fix this, data engineers use compaction to sort the data, grouping similar values into the same physical files.
The Limitation of Linear Sorting
Traditional sorting is hierarchical and linear. If you sort a table by country, and then by city, the data is perfectly clustered for a query filtering by country. However, if a user queries the table filtering only by city = 'Paris', the query engine will still have to scan many files because "Paris" is scattered across the files associated with the "France" country block. Linear sorting strongly biases performance toward the first column in the sort key.
What is Z-Ordering?
Z-Ordering (or Z-curve routing) is an advanced space-filling curve mathematical technique used during Iceberg table compaction. Instead of sorting hierarchically, Z-Ordering maps multi-dimensional data into a single dimension, interleaving the binary representation of the values from multiple columns.
This creates a layout where data points that are logically close in multiple dimensions (e.g., both country and city) are stored physically close together on the disk. Z-Ordering eliminates the bias of hierarchical sorting, giving equal sorting weight to all columns included in the Z-Order expression.
When to Use Z-Ordering
Z-Ordering is computationally expensive to execute during a compaction job. Therefore, it is best applied under specific conditions:
- Multiple Uncorrelated Filters: When business users frequently query a table using different combinations of filters (e.g., querying by
customer_idin one query, andproduct_idin another). - High Cardinality: Z-Ordering works best on columns with many unique values. It is less effective for low-cardinality columns (like boolean flags or gender), which are better suited for traditional Iceberg partitioning.
By applying Z-Ordering to the most frequently filtered columns, an organization can dramatically enhance Iceberg's data skipping efficiency, resulting in faster query execution and lower compute costs.



