Data quality rules are the specific, testable expressions that define what acceptable data looks like. Unlike the general concept of a data quality framework (the process and tooling), data quality rules are the concrete assertions: 'the revenue column must be greater than zero', 'order_id must be unique', 'event_time must be within the last 48 hours'. Rules translate business requirements into machine-enforceable checks.

Common Rule Categories

Data quality rules in lakehouse environments fall into standard categories: Completeness rules (required columns must be non-null, expected row counts must fall within range). Validity rules (values must match expected patterns, enums, or ranges). Uniqueness rules (no duplicate primary keys within a partition or table). Referential integrity rules (foreign key values must exist in the referenced dimension table). Timeliness rules (data freshness: the most recent event_time must be within the expected ingestion window). Distribution rules (statistical anomaly detection: column mean or standard deviation must not deviate more than 3 sigma from historical baseline). These rules are implemented in dbt tests, Great Expectations expectations, or Soda checks, and run automatically after each Iceberg table write.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon