Data lineage tracking is the practice of recording and visualizing the full provenance of data: where it came from, what transformations it passed through, and where it went. In regulated industries (finance, healthcare, government), lineage documentation is a compliance requirement. For AI and ML applications, lineage is essential for reproducibility: knowing exactly which version of which training data produced a model's current behavior.

Iceberg's Native Lineage Metadata

Apache Iceberg's architecture is inherently metadata-rich, providing built-in lineage signals at multiple granularities:

Iceberg V3 Row Lineage

The Apache Iceberg V3 specification, introduced in 2025, adds row-level lineage tracking through two new system columns:

OpenLineage and External Platforms

For cross-system visual lineage graphs, organizations integrate Iceberg metadata with OpenLineage, an open standard that captures lineage events from Spark, Flink, dbt, and Airflow jobs. When Iceberg tables are used as inputs or outputs in these jobs, OpenLineage records the full dependency graph. Governance platforms like DataHub, Atlan, and Apache Atlas then ingest these events to provide visual lineage maps showing how data flows from source systems through transformations to downstream analytics and ML models.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon