Project Nessie is an advanced, open-source transactional catalog designed specifically for modern data lakehouses built on Apache Iceberg. While standard catalogs simply track the current state of a table, Nessie maintains a versioned history of the entire catalog state, introducing software engineering principles to data management in a paradigm known as Data-as-Code.
Git-Like Capabilities for Data
Nessie allows data engineers to interact with their data lake using operations familiar to any software developer who uses Git:
- Branching: Users can create isolated branches of the catalog instantly. Because Iceberg separates metadata from data, branching is a lightweight metadata operation. No underlying Parquet files are duplicated. A data scientist can create an
experimentbranch, run destructive transformations, and train models without ever affecting themainproduction branch. - Commits: Every change made to tables within Nessie is recorded as an immutable commit. This provides an exact, auditable history of how the lakehouse evolved over time.
- Merging: Once a data engineering pipeline finishes running tests on a staging branch, the changes can be merged back into the
mainbranch. The merge process checks for conflicts and safely applies the updates. - Tagging: Organizations can tag specific commits (e.g.,
Q1_Earnings_Release) to create permanent, reproducible snapshots of the entire lakehouse at a specific point in time, essential for compliance and auditing.
Multi-Table Transactions
Standard Apache Iceberg guarantees ACID transactions at the individual table level. However, modern ETL jobs often span multiple tables - for instance, inserting data into a fact table while simultaneously updating three dimension tables. If the pipeline fails halfway through, the tables may become out of sync.
Nessie solves this by enabling cross-table atomic commits. By executing the multi-table ETL process on an isolated branch, none of the changes are visible to production. Once the entire pipeline succeeds, the branch is merged into production in a single, atomic operation. The consumers querying the `main` branch will either see all the updates simultaneously or none at all, ensuring perfect consistency.
Ecosystem Integration
Nessie natively implements the Iceberg REST Catalog API, meaning it seamlessly integrates with modern query engines like Spark, Flink, and Trino. For organizations managing massive, complex data pipelines where environment isolation (dev, stage, prod) and strict data quality gates are required, Project Nessie provides the foundational infrastructure to treat data as rigorously as code.



