Data provenance is the documented history of a dataset's origin and all transformations applied to it from creation to its current state. In a regulatory context, provenance answers: where did this specific data point come from, was it collected with the appropriate consent, which pipelines processed it, and has it ever been shared with a third party? In an ML context, provenance answers: what exactly was in the training dataset that produced this model, and could that dataset be reproduced exactly if needed?
Provenance in the Iceberg Lakehouse
Apache Iceberg's snapshot metadata provides the technical foundation for data provenance. Each snapshot records the operation type and a summary of the source query or job. OpenLineage events from Spark, Flink, and Airflow jobs add job-level provenance: which input datasets fed which output datasets, with full transformation code references. When combined with a metadata platform like DataHub, the complete provenance graph for any Iceberg table is queryable: trace from a Gold-layer analytics table back through dbt transformations, through Silver-layer cleansing, through Bronze ingestion, all the way to the source system and original event timestamp.

