Federated Analytics

Federated Analytics is the ability of a query engine to execute SQL queries that span multiple, physically separate data sources simultaneously, returning a unified result set as if all the data lived in a single database. This is distinct from data consolidation approaches (like ETL into a central warehouse) because data stays in place at each source and is queried directly.

The Federated Query Lifecycle

When a federated query arrives at an engine like Dremio or Trino, the optimizer identifies which portions of the query reference which data sources. It then:

Pushes source-specific filter predicates down to each individual source (a SQL push to PostgreSQL, a manifest scan to Iceberg, a REST call to a SaaS API)
Retrieves only the filtered result sets from each source across the network
Performs the join, aggregation, or union in the central query engine's memory using the results from all sources

Use Cases

Federated analytics unlocks several important patterns that were previously expensive or impossible without massive ETL investment:

Operational + Historical Joins: Joining live OLTP data (current inventory from PostgreSQL) with historical Iceberg lakehouse data (sales trends from the past 3 years) in a single query.
Cross-Cloud Analytics: Querying data that lives in AWS S3, Azure Data Lake, and Google Cloud Storage simultaneously without copying data between clouds.
SaaS Data Integration: Joining Salesforce CRM data with your internal Iceberg revenue tables without building a nightly ETL pipeline.