Copy-on-Write (CoW) is the default data modification strategy in Apache Iceberg. Unlike Merge-on-Read, which defers data rewriting until a query occurs, Copy-on-Write handles the cost of updates and deletes immediately during the write phase.
How Copy-on-Write Works
When an `UPDATE`, `DELETE`, or `MERGE` statement is executed against an Iceberg table configured for CoW, the query engine must perform the following steps:
- Read: The engine reads the specific data file (e.g., a Parquet file) that contains the row or rows to be modified.
- Modify: The engine applies the update or deletion in memory.
- Write (Copy): The engine writes an entirely new data file containing the updated dataset.
- Commit: The Iceberg metadata is updated to remove the reference to the old file and add the reference to the new file.
Because object storage files (like those in S3 or ADLS) are immutable, they cannot be modified in place. Changing a single row in a 500MB Parquet file requires writing a new 500MB Parquet file. This is known as write amplification.
When to Use Copy-on-Write
CoW optimizes entirely for read performance. Because the data files are always completely up-to-date and clean (with no separate delete files to merge), the query engine can execute analytical reads as fast as possible. CoW is the recommended strategy for:
- Read-Heavy Analytical Tables: Dashboards and reports where sub-second query performance is critical.
- Append-Only or Bulk-Update Workloads: Tables where data is primarily appended in large batches, or where updates occur infrequently in massive sweeps (e.g., a nightly batch job correcting yesterday's data).
- Low-Maintenance Environments: Because there are no delete files, CoW tables do not suffer from the read degradation that plagues uncompacted MoR tables.
When NOT to use Copy-on-Write
CoW becomes a massive bottleneck in environments with high-frequency, small updates - such as streaming ingestion pipelines or real-time Change Data Capture (CDC) from a transactional database. If a pipeline attempts to apply thousands of individual row updates per minute, CoW will spend all of its compute resources constantly rewriting the same large data files, leading to severe latency and high cloud compute bills. In those scenarios, Merge-on-Read is the required architectural choice.



