Lakehouse Architecture

The Lakehouse Architecture is not a single product. It is a design pattern that specifies how four distinct layers of an enterprise data platform should relate to each other. Each layer has a clear responsibility, and the interfaces between them are defined by open standards rather than proprietary APIs. This open-standards approach is what distinguishes the Lakehouse from vendor-managed cloud data warehouses and what gives organizations the flexibility to evolve individual layers independently.

The Four Layers

Layer 1: Cloud Object Storage The physical foundation. All data files live here in Parquet format. This layer is entirely fungible: moving from AWS S3 to Azure ADLS to Google GCS changes nothing about how the layers above operate, because all three implement the S3-compatible object storage API. Cost is negligible at roughly $0.02 per GB per month.

Layer 2: Open Table Format (Apache Iceberg) This layer adds reliability to the raw file layer. The Iceberg metadata (snapshot files, manifest lists, and manifests) tracks which Parquet files constitute the current committed state of each table. It provides ACID transactions, schema evolution, partition evolution, and time travel. This layer is also open: any engine that implements the Iceberg spec can read and write tables.

Layer 3: Open Catalog (Apache Polaris) The catalog tracks every table's location, schema, access policy, and governance tags. It implements the Iceberg REST Catalog API, which means any engine that supports that spec can discover and access tables through a standardized interface. This is where data stewards manage column classifications, Row-Level Security rules, and role assignments for both human users and AI agents.

Layer 4: Query and Execution Engine (Dremio) The engine reads table metadata from the catalog, scans the appropriate Parquet files from object storage, executes the analytical computation, and returns results. It handles predicate pushdown (filtering at the storage level to minimize data transfer), reflection-based query acceleration, and enforcement of the governance policies defined in Layer 3. Multiple engines can sit at this layer simultaneously.

The Delta Layer: Semantic and AI

A fifth layer has emerged in modern Lakehouse deployments. Sitting above the query engine, the Semantic and AI layer encodes business logic (metric definitions, table descriptions, contextual metadata) and hosts the orchestration framework for AI agents. This layer reads from the catalog and execution engine layers but does not touch the storage layers directly. It is the layer most directly responsible for making the Lakehouse "agentic."

Substitutability as a Design Principle

The defining architectural principle of the Lakehouse is that any layer can be substituted without affecting the others. An organization can replace Dremio with Spark for a specific workload without migrating data. It can replace Apache Polaris with a different Iceberg REST Catalog implementation without changing the table format. This substitutability is the source of the Lakehouse's long-term cost advantage over vendor-locked platforms.

The Four Layers

The Delta Layer: Semantic and AI

Substitutability as a Design Principle

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Lakehouse Architecture

The Four Layers

The Delta Layer: Semantic and AI

Substitutability as a Design Principle

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone