Federated Querying

Federated Querying allows a distributed SQL engine to execute a single query across multiple, physically separate data sources and return a unified result set to the caller. The data never moves. Instead, the query engine translates the logical SQL plan into source-native sub-queries, routes each fragment to the appropriate backend, retrieves the minimally scoped result, and merges everything locally before returning it to the analyst or AI agent.

Dremio's architecture is built around this pattern. Each connected source (Apache Iceberg tables in S3, live PostgreSQL, Elasticsearch, MongoDB, Snowflake, or other systems) appears as a first-class schema namespace in the unified query space. A user can write a single JOIN between iceberg.gold.customers and postgres.crm.accounts and receive a result as if both tables lived in the same database.

How the Engine Routes the Work

When Dremio receives a federated query, the optimizer identifies which parts of the query can be pushed down to each source. For an Iceberg source, the engine pushes column projections and row filters (predicate pushdown) directly into the Iceberg scan, reading only the Parquet row groups that pass the filter criteria. For a relational source like PostgreSQL, it translates the relevant SQL fragment into a PostgreSQL-compatible query and submits it over JDBC or the Arrow Flight SQL protocol. The engine then performs any cross-source joins or aggregations locally using its vectorized execution engine.

The key cost-reduction mechanism here is filter pushdown. If an AI agent asks for customers in the Northeast region who placed orders over $500 last quarter, the engine applies those filters at each source before transferring data. The fraction of data that crosses the network is small compared to a full table scan.

When Federated Querying Is the Right Tool

Federated querying is the best fit when data is in a source system that does not need to be replicated into the lakehouse. Point lookups against low-volume operational tables, joining a small reference dataset from a SaaS application against a large Iceberg fact table, and ad-hoc exploration of a new data source before building a pipeline are all good candidates.

It is less appropriate for large-scale aggregations that require full table scans on heavily loaded OLTP systems. Scanning 100 million rows from a production PostgreSQL instance to answer an analytical question will degrade that database's response time for its operational workload. In those cases, a Change Data Capture pipeline that writes the data to an Iceberg table is the better pattern.

What This Means for AI Agents

Federated querying significantly expands the data landscape available to AI agents without requiring data engineering work first. An agent investigating a customer behavior anomaly can pull historical transaction data from Iceberg and the customer's current account status from the CRM in a single query, getting a complete picture immediately. As the agent's investigation deepens and it discovers which data sources are consistently valuable, a data engineer can build formal ingestion pipelines at that point, making the workflow both exploratory and iterative.

How the Engine Routes the Work

When Federated Querying Is the Right Tool

What This Means for AI Agents

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Federated Querying

How the Engine Routes the Work

When Federated Querying Is the Right Tool

What This Means for AI Agents

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone