Unlike traditional data warehouses that manage their own internal storage invisibly, a Data Lakehouse gives you full control over your data in object storage. With this control comes the responsibility of routine table maintenance. In Apache Iceberg, every write, update, or delete operation creates a new snapshot and new metadata files. Without maintenance, tables will suffer from metadata bloat and degraded query performance.
A healthy Iceberg maintenance lifecycle follows three specific operations, executed in this strict order:
1. Expire Snapshots
Because Iceberg retains historical data to enable "Time Travel" queries, it never overwrites old data by default. Over months of ingestion, a table might accumulate thousands of snapshots, causing the metadata files to become massive and slowing down query planning.
The Expire Snapshots operation tells the catalog to permanently drop snapshot metadata older than a specific date (e.g., older than 7 days) while retaining the current active state. Crucially, this operation also identifies which underlying Parquet data files are no longer needed by any of the surviving snapshots, marking them for physical deletion.
2. Remove Orphan Files
Distributed computing environments are prone to failure. If a Spark job crashes halfway through writing a new batch of data, those Parquet files might be left sitting in your S3 bucket without ever being successfully committed to the Iceberg catalog. These are called "orphan files."
The Remove Orphan Files operation scans the physical storage directory, compares the files against the active Iceberg metadata, and permanently deletes any unregistered files. Warning: You should always use a safety buffer (like deleting files older than 3 days) to ensure you don't accidentally delete files that are actively being written by a slow, currently running ingestion job.
3. Rewrite Data Files (Compaction)
Streaming ingestion or frequent small updates will result in thousands of tiny Parquet files (the "Small File Problem"). Query engines struggle with small files because the network I/O overhead of opening them outweighs the time spent reading the actual data.
The Rewrite Data Files operation (commonly known as Compaction) reads these fragmented small files and rewrites them into fewer, optimally sized larger files (typically 128MB to 512MB). This process is completely safe and non-blocking for readers. During compaction, you can also apply optimization strategies like Z-Ordering or sorting to drastically improve future data-skipping performance.
Automation
In modern Agentic Lakehouse architectures, these three maintenance tasks are rarely executed manually. Platforms like Databricks, Snowflake, and specialized catalogs like Tabular/Polaris provide managed services that automatically trigger these procedures in the background based on predefined table policies.



