Data Discovery

A data estate of ten tables is navigable by memory. A data estate of ten thousand tables is only navigable with deliberate Data Discovery infrastructure. Data Discovery is the collection of tools, practices, and search mechanisms that make datasets findable by anyone who needs them, regardless of whether they know the table's exact name, the team that created it, or the technical system it lives in.

For AI agents, discovery is even more critical. An agent cannot ask a colleague "do we have data on customer returns?" The way an agent discovers relevant datasets is entirely through programmatic interfaces to the catalog and its search capabilities. If those interfaces are poor, the agent either guesses (and hallucinates table names) or fails to locate data it needs.

Keyword Search

The most basic discovery mechanism is full-text search across catalog metadata. A user types "customer lifetime value" and the catalog returns tables and columns whose names, descriptions, or tags contain those terms. The quality of this approach is directly tied to how thoroughly the catalog is populated: a table with a description of "raw order data including shipment and return events from the ERP system, used to calculate customer revenue contributions" is highly searchable. A table named "tbl_raw_7" with no description is invisible to keyword search regardless of its value.

Dremio includes AI-enabled semantic search in its platform that goes beyond keyword matching. This feature uses metadata, column names, wikis, and tags to interpret the intent of a search query rather than matching exact terms, surfacing relevant data assets even when the user's language does not exactly match the documentation language.

Lineage-Based Discovery

Lineage graphs are a discovery tool as much as an audit tool. An analyst who sees a suspicious number in a dashboard can navigate backward through the lineage to identify every contributing table and transformation step. This reverse-engineering approach is particularly useful for understanding data trust levels: a table that originates from a source with known data quality issues is less trustworthy than one from a verified, well-maintained source, and lineage makes that provenance visible.

Usage-Based Recommendations

Modern data catalogs track which tables analysts access, which columns they query, and which combinations they join. This usage metadata can be analyzed to surface recommendations: "analysts who query this customer table also frequently join it with the subscription events table." For a new team member or a first-time AI agent session, usage-based recommendations provide a shortcut to the established data patterns the organization already relies on, reducing the time spent searching for relevant data before starting an analysis.

Keyword Search

Lineage-Based Discovery

Usage-Based Recommendations

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Data Discovery

Keyword Search

Lineage-Based Discovery

Usage-Based Recommendations

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone