Distributed Query Processing is the set of techniques and coordination mechanisms that allow a cluster of networked computers to cooperate in executing a single SQL query. It is the foundation of all modern cloud-scale data lakehouses, enabling analytical workloads that would be physically impossible on a single machine.
The Coordinator-Worker Model
Most distributed query engines follow a coordinator-worker model. A single coordinator node (sometimes called the master or driver) is responsible for:
- Parsing the incoming SQL query
- Consulting the catalog (e.g., the Iceberg REST Catalog) for table metadata and statistics
- Generating a distributed execution plan via the Cost-Based Optimizer
- Breaking the plan into stages and distributing tasks to worker nodes
- Collecting and merging results from worker nodes to return to the client
Worker nodes execute the actual data processing: reading Parquet files from S3, applying filters, evaluating expressions, joining partial results, and computing partial aggregations.
Data Shuffling
The most expensive operation in distributed query processing is the "shuffle," where data must be redistributed across the network between stages. For a distributed GROUP BY operation, each worker first computes local partial aggregations, then data with the same grouping key must be shuffled to the same worker node for final aggregation. Minimizing unnecessary shuffles is a primary focus of distributed query optimizer design.
Fault Tolerance
Unlike single-machine databases, distributed query engines must handle worker node failures gracefully. Apache Spark achieves fault tolerance through RDD lineage (re-computing lost partitions from the original data). Trino and Dremio use a different approach: spilling intermediate results to resilient storage or simply re-submitting failed tasks. The choice between fault tolerance strategies significantly impacts the performance profile and operational complexity of the query engine.

