Data modeling is the process of defining how data is organized, structured, and related within a database or lakehouse. The right modeling approach determines how easily different types of queries can be answered, how maintainable the schema is as business requirements evolve, and how efficiently storage is used. Modern lakehouses typically employ different modeling approaches at different layers of the data stack.
The Medallion Architecture Layers
Most Iceberg lakehouse teams organize their data in a medallion (multi-tier) architecture where modeling approach evolves from raw to refined:
- Bronze (Raw): Data is stored as-is from the source system with minimal transformation. Schema mirrors the source schema exactly. No modeling applied. Purpose is auditability and reprocessing capability.
- Silver (Cleansed): Data is cleaned, validated, deduplicated, and type-corrected. Schema may be partially normalized. CDC merges are applied here. Source-aligned but analysis-ready.
- Gold (Analytical): Data is modeled for specific analytical use cases using dimensional (Kimball), Data Vault, or wide-table patterns. This layer serves BI tools, dashboards, and AI agents directly.
Choosing the Right Model
- Dimensional Modeling (Kimball): Best for BI dashboards and OLAP queries. Star schemas with fact and dimension tables provide intuitive, join-efficient structures that BI tools auto-understand.
- Data Vault 2.0: Best for audit-heavy, enterprise data warehouses where every source system change must be traceable. Separates descriptive attributes (satellites) from relationships (links) and identifiers (hubs).
- Wide Tables / Feature Stores: Best for ML feature engineering and AI agents. Denormalized, prejoined tables with hundreds of feature columns eliminate join complexity in training pipelines and agentic SQL generation.

