Data orchestration is the process of coordinating, scheduling, and monitoring the execution of multi-step data pipelines. In a data lakehouse, a typical pipeline might: extract data from source systems via CDC, load it into Bronze Iceberg tables, run dbt transformations to produce Silver and Gold tables, execute data quality checks, trigger compaction on hot tables, and notify downstream consumers that fresh data is available. An orchestration platform manages the dependencies between these steps, ensures they execute in the correct order, handles failures, and provides visibility into pipeline health.
Core Orchestration Concepts
- DAG (Directed Acyclic Graph): The data structure used by orchestrators to model pipeline dependencies. Each node is a task; edges represent dependencies. A DAG ensures tasks execute in dependency order without circular waits.
- Scheduling: Triggering pipelines on a time-based schedule (daily, hourly, every 15 minutes) or event-driven schedule (triggered when new data arrives in S3 or when a sensor detects a new Iceberg snapshot).
- Retries and Alerting: Automatically retrying failed tasks with configurable backoff, and alerting data engineers via Slack, PagerDuty, or email when pipelines fail or take longer than expected.
Popular Orchestration Tools
Apache Airflow is the most widely deployed orchestration platform, with a large ecosystem of providers for Spark, dbt, Iceberg, S3, and most cloud services. Dagster and Prefect are newer, Python-native alternatives with stronger support for data-centric concepts like assets and software-defined assets. For teams using dbt as their primary transformation layer, dbt Cloud's scheduling and orchestration features may be sufficient without a separate orchestration platform.

