S3 Data Lake

Amazon S3 (Simple Storage Service) is the most widely deployed data lake storage backend in the world. Its combination of unlimited capacity, eleven nines of durability, and cent-per-GB pricing made it the de facto home for enterprise data that doesn't fit in a traditional database. Building a well-structured S3 Data Lake is the first step toward a production Data Lakehouse.

Bucket and Prefix Design

S3 organizes objects using buckets and key prefixes. A well-designed S3 Data Lake separates data by domain and processing tier. A common pattern uses a single bucket with prefixes that mirror the Medallion Architecture:

s3://company-lakehouse/bronze/ - raw ingested data, partitioned by source and date
s3://company-lakehouse/silver/ - cleaned and standardized Parquet, managed by Iceberg
s3://company-lakehouse/gold/ - business-domain aggregations and AI-ready feature tables
s3://company-lakehouse/metadata/ - Iceberg catalog metadata files (manifests and manifest lists)

IAM and Access Control

AWS Identity and Access Management (IAM) policies govern which principals (users, roles, AI agent task execution roles) can read or write specific S3 prefixes. In a lakehouse context, IAM policies should be organized by data tier: the AI agent's IAM role gets read access to gold-tier prefixes and no write permissions. ETL pipeline roles get write access to bronze and silver tiers. Data stewards get write access to the metadata prefix for catalog operations. These IAM boundaries complement (but do not replace) the row-level and column-level security enforced by the query engine and catalog.

The Upgrade Path: From S3 Data Lake to Iceberg Lakehouse

Many organizations already have existing data in S3 in Parquet format. The migration to an Apache Iceberg Lakehouse does not require rewriting the data files. Iceberg can register existing Parquet files into a new Iceberg table via a metadata-only operation that creates the necessary manifest and manifest list files pointing to the existing Parquet objects. After registration, the S3 files are queryable as a governed Iceberg table with full time-travel, schema evolution, and access control capabilities, without a single byte of data being copied.

Bucket and Prefix Design

IAM and Access Control

The Upgrade Path: From S3 Data Lake to Iceberg Lakehouse

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone

S3 Data Lake

Bucket and Prefix Design

IAM and Access Control

The Upgrade Path: From S3 Data Lake to Iceberg Lakehouse

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone