Data Lineage is the recorded history of a dataset's origin, transformation steps, and downstream consumers. A lineage graph answers two fundamental questions: "where did this data come from?" (upstream lineage) and "what will break if I change this?" (downstream impact analysis). Without lineage, data estates are fragile: a schema change in one upstream table propagates silently through dozens of dependent transformations, producing incorrect results that no one detects until an analyst or an AI agent generates an obviously wrong number.
Lineage exists at two granularities. Table-level lineage tracks which source tables contributed to which output tables through which pipeline jobs. Column-level lineage goes further, tracking how each output column was derived: "revenue in this gold table comes from the sum of the order_amount column in the silver orders table, filtered to status = 'SHIPPED'." Column-level lineage is dramatically more useful for debugging but harder to collect automatically.
OpenLineage: The Open Standard
OpenLineage is an open standard specification for lineage metadata that enables different tools in the data stack to emit lineage events in a common format. Apache Airflow, dbt, Apache Spark, and Apache Flink all support OpenLineage event emission natively or through plugins. When these tools emit lineage events to a compatible backend (Marquez is the reference implementation), the lineage graph builds automatically as pipelines run, without requiring manual documentation.
Dremio captures query-level lineage for SQL views and virtual datasets, recording which physical tables contribute to each derived view. This lineage data is exposed through the Dremio catalog, giving catalog consumers a complete picture of how Dremio's virtual datasets relate to their underlying Iceberg tables.
Lineage as AI Trust Infrastructure
Data lineage is governance infrastructure for AI agents as much as for human analysts. When an AI agent evaluates whether to trust a dataset for a sensitive analysis, lineage provides the provenance evidence: this gold-tier aggregation table was derived from a certified, quality-checked silver table, which in turn came from the company's authoritative ERP system, not from a spreadsheet upload of unknown origin. Lineage transforms "I trust this data" from an assumption into a verifiable chain of custody.
Lineage is also essential for compliance. GDPR's right to erasure requires knowing which downstream tables contain data derived from a specific subject's records. Without column-level lineage, this analysis requires manually tracing every transformation in the pipeline stack. With lineage, it is a graph query.



