Data engine scaling refers to the strategies and mechanisms used to adjust the compute capacity of a query engine in response to changing workload demands. Getting scaling right is critical for lakehouse economics: over-provisioning wastes money on idle compute, while under-provisioning causes slow queries and degraded user experience.
Scale-Up vs. Scale-Out
- Scale-Up (Vertical Scaling): Replacing worker nodes with larger, more powerful machines (more CPU cores, more RAM, faster NVMe SSDs). Effective for query patterns that benefit from having more CPU and memory per node, such as complex join operations that spill to disk on smaller nodes. Limited by the maximum available instance size in the cloud provider's catalog.
- Scale-Out (Horizontal Scaling): Adding more worker nodes of the same size to the cluster. Effective for increasing parallel scan throughput on large Iceberg tables, since more nodes can read more Parquet files simultaneously. Scales linearly for I/O-bound workloads.
Auto-Scaling
Modern lakehouse platforms implement auto-scaling that continuously monitors query queue depth, CPU utilization, and memory pressure across the cluster, automatically adding or removing nodes to match current demand. Dremio Cloud, Databricks SQL Serverless, and Snowflake Virtual Warehouses all support auto-scaling policies. The key metric is "time to first byte": auto-scaling must provision new nodes faster than query queueing time becomes perceptible to users.
Iceberg's Role in Scaling Efficiency
Iceberg's partition metadata significantly improves the efficiency of horizontal scaling. The engine can distribute file-level scan tasks evenly across worker nodes using partition and file statistics, ensuring all nodes are equally busy during the scan phase. Without this metadata, naive file-level parallelism would produce skewed task distributions where some workers finish early while others are overloaded with large files.

