Iceberg Compaction

Compaction is an essential table maintenance operation in Apache Iceberg. Over time, as data is ingested (especially via streaming or micro-batches) and updated, the physical layout of an Iceberg table can degrade, leading to slower query times. Compaction restructures this data in the background to restore optimal performance.

Why is Compaction Necessary?

There are two primary reasons a table requires compaction:

The Small File Problem: Cloud object storage (like Amazon S3) is optimized for reading a few large files (e.g., 256MB to 512MB) rather than thousands of tiny KB-sized files. Streaming ingestion often produces many small files, which overwhelms the query engine with network requests and metadata parsing overhead.
Delete File Bloat: In tables using the Merge-on-Read (MoR) strategy, updates and deletes generate "delete files." At read time, the query engine must scan the base data and filter out rows matched by these delete files. As delete files accumulate, read performance degrades rapidly.

Compaction Strategies

Iceberg supports different strategies for rewriting data files during compaction, allowing engineers to balance maintenance cost against read performance:

1. Bin-pack Compaction

The simplest and fastest strategy. Bin-packing takes multiple small data files and combines them into fewer, target-sized data files (e.g., merging ten 25MB files into one 250MB file). It does not change the order of the data within the files. It is the cheapest way to solve the small file problem.

2. Sort Compaction

A more expensive, but highly effective strategy. Sort compaction physically reorganizes the data within the files based on specific columns (e.g., sorting a sales table by customer_id). This ensures that similar data is grouped tightly together. As a result, the column-level min/max statistics stored in the Iceberg Manifest Files become highly precise. When a query filters by that column, the engine can "skip" large chunks of the table entirely, drastically reducing query times. Advanced multi-column sorting (like Z-ordering) can be used to optimize for queries that filter on multiple columns simultaneously.

Applying Deletes

Regardless of the strategy used, the compaction process reads the existing data files, applies any pending equality or position delete files, and writes out fresh, clean Parquet files. After the compaction job commits, the query engine no longer has to process those delete files on the fly.

In modern lakehouse architectures, compaction is often handled automatically by managed services or automated control planes (like Dremio's autonomous table optimization or AWS Glue), freeing data engineers from writing manual maintenance scripts.

Why is Compaction Necessary?

Compaction Strategies

1. Bin-pack Compaction

2. Sort Compaction

Applying Deletes

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Iceberg Compaction

Why is Compaction Necessary?

Compaction Strategies

1. Bin-pack Compaction

2. Sort Compaction

Applying Deletes

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone