Apache Airflow (originally developed at Airbnb and donated to the Apache Software Foundation in 2016) is the industry-leading open-source workflow orchestration platform. It allows data engineers to define pipelines as Python code (DAGs), schedule them on time-based or event-driven triggers, monitor execution in a rich web UI, and receive alerts on failures. Airflow is the most commonly used orchestrator in enterprise lakehouse architectures.

Airflow in the Lakehouse Context

A typical Airflow DAG orchestrating an Iceberg lakehouse might sequence: a sensor that waits for new files to arrive in S3, a SparkSubmitOperator that runs the ingestion job writing to a Bronze Iceberg table, a BashOperator running dbt transformations to produce Silver and Gold Iceberg tables, a PythonOperator executing Great Expectations quality checks, a SparkSubmitOperator triggering Iceberg compaction, and finally a notification task alerting downstream teams that fresh data is available.

Airflow Providers for Iceberg

The Apache Airflow Provider ecosystem includes integrations for all major cloud platforms and data tools. For Iceberg lakehouses, the relevant providers include: the Apache Spark provider (SparkSubmitOperator, SparkKubernetesOperator), the Amazon AWS provider (EMR operators, Glue operators, S3 sensors), the dbt provider (DbtRunOperator, DbtTestOperator), and the Apache Hive provider (for catalog interactions). Managed Airflow services like Amazon MWAA (Managed Workflows for Apache Airflow) and Google Cloud Composer eliminate the operational burden of self-hosting Airflow, allowing teams to focus on pipeline logic.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon