Lakehouse Cost Optimization

A Data Lakehouse can be dramatically cheaper than a proprietary cloud data warehouse, but only if it is actively managed. An unoptimized lakehouse accumulates small files, keeps historical data in expensive hot-tier storage, runs redundant full table scans for the same queries repeatedly, and holds compute clusters running during idle periods. Lakehouse Cost Optimization is the operational discipline of identifying and eliminating these patterns systematically.

Storage: Tiering Old Data

Amazon S3 Standard costs roughly $0.023 per GB per month. S3 Glacier Instant Retrieval costs $0.004. For a 100 TB lakehouse where 70% of the data is historical bronze-tier audit logs accessed less than once per quarter, moving that 70 TB to Glacier-class storage saves roughly $1,330 per month with no impact on the frequently accessed gold-tier analytical tables. S3 Lifecycle Rules automate these transitions based on object age without requiring any changes to the Iceberg metadata that points to those files.

Iceberg also supports snapshot expiration: removing old table snapshots (and the data files orphaned by them) on a schedule. Running expire_snapshots after a defined retention window (typically 7-30 days, depending on time-travel requirements) reclaims storage consumed by versions of the data that are no longer needed for audit or rollback purposes.

Storage: File Compaction

Streaming ingestion pipelines frequently produce thousands of tiny Parquet files, sometimes just kilobytes each. Small files are expensive in two ways: each requires a separate S3 GET request (S3 charges per request, not just per byte), and they do not compress as efficiently as large files. Iceberg's rewrite_data_files compaction procedure merges small files into optimally sized files (256 MB to 1 GB), reducing both request counts and storage footprint simultaneously. Scheduling compaction nightly on high-write tables is a standard production practice.

Compute: Reflections for Repeated Queries

The highest compute cost in any lakehouse is repeatedly scanning the same large datasets to answer the same questions. Dremio Reflections address this directly: they pre-compute and cache the results of common query patterns in Arrow columnar format. When Dremio detects that a new query matches an existing reflection, it serves the result from the reflection without touching the source Iceberg files. A daily revenue aggregation reflection that compresses a 500 GB scan into a 10 MB cached result will eliminate that 500 GB scan for every dashboard refresh and AI agent query that asks the same question.

Compute: Autoscaling and Idle Timeout

A compute cluster that is running but receiving no queries is pure waste. Dremio Cloud supports autoscaling engines that scale down to zero when no queries are active and scale back up automatically when demand returns. Setting aggressive idle timeout policies (15-30 minutes for non-critical workloads) and configuring autoscaling groups reduces compute costs by 60-80% for workloads with uneven utilization patterns, which includes most AI agent sessions that run intensively for short periods and then go quiet.

Storage: Tiering Old Data

Storage: File Compaction

Compute: Reflections for Repeated Queries

Compute: Autoscaling and Idle Timeout

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

Lakehouse Cost Optimization

Storage: Tiering Old Data

Storage: File Compaction

Compute: Reflections for Repeated Queries

Compute: Autoscaling and Idle Timeout

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone