Data orchestration is the process of coordinating, scheduling, and monitoring the execution of multi-step data pipelines. In a data lakehouse, a typical pipeline might: extract data from source systems via CDC, load it into Bronze Iceberg tables, run dbt transformations to produce Silver and Gold tables, execute data quality checks, trigger compaction on hot tables, and notify downstream consumers that fresh data is available. An orchestration platform manages the dependencies between these steps, ensures they execute in the correct order, handles failures, and provides visibility into pipeline health.

Core Orchestration Concepts

Popular Orchestration Tools

Apache Airflow is the most widely deployed orchestration platform, with a large ecosystem of providers for Spark, dbt, Iceberg, S3, and most cloud services. Dagster and Prefect are newer, Python-native alternatives with stronger support for data-centric concepts like assets and software-defined assets. For teams using dbt as their primary transformation layer, dbt Cloud's scheduling and orchestration features may be sufficient without a separate orchestration platform.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon