Data quality rules are the specific, testable expressions that define what acceptable data looks like. Unlike the general concept of a data quality framework (the process and tooling), data quality rules are the concrete assertions: 'the revenue column must be greater than zero', 'order_id must be unique', 'event_time must be within the last 48 hours'. Rules translate business requirements into machine-enforceable checks.
Common Rule Categories
Data quality rules in lakehouse environments fall into standard categories: Completeness rules (required columns must be non-null, expected row counts must fall within range). Validity rules (values must match expected patterns, enums, or ranges). Uniqueness rules (no duplicate primary keys within a partition or table). Referential integrity rules (foreign key values must exist in the referenced dimension table). Timeliness rules (data freshness: the most recent event_time must be within the expected ingestion window). Distribution rules (statistical anomaly detection: column mean or standard deviation must not deviate more than 3 sigma from historical baseline). These rules are implemented in dbt tests, Great Expectations expectations, or Soda checks, and run automatically after each Iceberg table write.

