Compute-Storage Separation

Compute-Storage Separation is the architectural decision that separates the compute resources used to execute queries from the storage system where data persists. It is not a new idea; database researchers proposed it in the 1980s, but cloud object storage made it economically viable at enterprise scale for the first time, and the Data Lakehouse is built on this principle as its load-bearing foundation.

In the traditional on-premise warehouse model (Teradata, early Hadoop clusters), compute and storage were physically co-located. Adding more query capacity meant buying servers with both CPUs and disks. Adding more storage meant buying the same servers. You could not scale one without scaling the other. This coupling made on-premise data warehouses expensive, inflexible, and prone to capacity planning errors in either direction.

Cloud Object Storage Changes the Economics

Cloud object storage (S3, ADLS Gen2, GCS) is an independent, durable, infinitely scalable service. Data stored in S3 persists regardless of whether any compute resources are attached. It costs the same per gigabyte whether zero or one thousand query engines are reading it simultaneously. This permanence and independence is what makes compute-storage separation practical.

Query engines in this model are stateless. A Dremio cluster has no data of its own. It reads from S3, executes queries against the data in memory, returns results, and releases its resources. The cluster can be sized up, scaled down, or shut down entirely without affecting the durability or accessibility of the underlying data. A new Dremio cluster started the next day reads from the same S3 data with full continuity.

Multi-Engine Access

Compute-storage separation enables the multi-engine architecture that defines the open lakehouse. Multiple query engines can read the same physical Iceberg tables simultaneously without any data copying or replication. Dremio handles interactive BI and AI agent queries. Apache Spark runs batch transformation jobs. Apache Flink writes streaming records. All three access the same Parquet files in S3, coordinated by the Iceberg catalog which manages concurrent write visibility through its optimistic concurrency control mechanism.

This is fundamentally different from a proprietary cloud data warehouse, where data can only be accessed through that vendor's own engine. Compute-storage separation gives organizations the freedom to use the best engine for each workload type without duplicating data between platforms.

The Practical Impact

For an organization running a mix of BI, ML, and AI agent workloads, compute-storage separation means each workload type can have its own appropriately sized and priced compute cluster. An AI agent investigation cluster running on spot instances during business hours costs a fraction of a permanent provisioned warehouse cluster. Data engineering batch jobs can run on large Spark clusters that scale down to zero between runs. Each cluster is optimized for its specific workload while all clusters share the same authoritative data in S3.

Cloud Object Storage Changes the Economics

Multi-Engine Access

The Practical Impact

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Compute-Storage Separation

Cloud Object Storage Changes the Economics

Multi-Engine Access

The Practical Impact

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone