Data masking is the practice of replacing, encrypting, or partially obscuring sensitive values in data so that unauthorized users see a substitute representation rather than the real value. Unlike column-level security (which hides a column entirely), data masking allows the column to exist in query results but with values that are non-identifiable. This enables analysts to study data distributions, run statistical analyses, and build models using realistic data structures without being exposed to raw PII or confidential values.
Masking Techniques
- Static Masking: A permanent transformation applied to data at rest. The original PII is replaced with fictitious but realistic values in a development or testing copy of the dataset. Used for safe developer access to production-like data in test environments.
- Dynamic Masking: Applied at query time. The underlying Iceberg Parquet files contain the real values. When an unauthorized user queries the data, the query engine substitutes masked values in the result. Authorized users (HR, compliance officers) receive the real values. This is the preferred approach for production lakehouse environments because only one copy of data is maintained.
- Format-Preserving Masking: The masked value has the same format as the original (e.g., a masked SSN still looks like a valid SSN: '***-**-4321'). This is important for BI tools that validate data formats and for referential integrity in analytical queries.
Masking and AI Systems
Data masking is critical for responsible AI development. When AI models are trained on lakehouse data, masking ensures that training datasets do not contain raw PII. With dynamic masking applied at the query engine level, the AI training pipeline automatically receives masked values regardless of what SQL it executes, providing a strong enforcement boundary that doesn't depend on data scientists manually de-identifying data before training.

