Data lineage tracking is the practice of recording and visualizing the full provenance of data: where it came from, what transformations it passed through, and where it went. In regulated industries (finance, healthcare, government), lineage documentation is a compliance requirement. For AI and ML applications, lineage is essential for reproducibility: knowing exactly which version of which training data produced a model's current behavior.
Iceberg's Native Lineage Metadata
Apache Iceberg's architecture is inherently metadata-rich, providing built-in lineage signals at multiple granularities:
- Snapshot History: Every write operation (APPEND, OVERWRITE, MERGE, DELETE) creates an immutable snapshot recording the timestamp, operation type, and summary. The full history of all changes to a table is preserved in the metadata layer.
- Schema Evolution History: The metadata log records every schema change (column additions, renames, type promotions) with timestamps, providing a complete audit trail of how the table structure has evolved.
Iceberg V3 Row Lineage
The Apache Iceberg V3 specification, introduced in 2025, adds row-level lineage tracking through two new system columns:
_row_id: A unique, persistent identifier assigned to each row at insertion that remains constant across updates and merges. This enables tracking a specific row's journey across multiple snapshots._last_updated_sequence_number: Records the Iceberg snapshot sequence number of the operation that most recently modified a row, enabling precise identification of which job last touched any given record.
OpenLineage and External Platforms
For cross-system visual lineage graphs, organizations integrate Iceberg metadata with OpenLineage, an open standard that captures lineage events from Spark, Flink, dbt, and Airflow jobs. When Iceberg tables are used as inputs or outputs in these jobs, OpenLineage records the full dependency graph. Governance platforms like DataHub, Atlan, and Apache Atlas then ingest these events to provide visual lineage maps showing how data flows from source systems through transformations to downstream analytics and ML models.

