Data masking is the practice of replacing, encrypting, or partially obscuring sensitive values in data so that unauthorized users see a substitute representation rather than the real value. Unlike column-level security (which hides a column entirely), data masking allows the column to exist in query results but with values that are non-identifiable. This enables analysts to study data distributions, run statistical analyses, and build models using realistic data structures without being exposed to raw PII or confidential values.

Masking Techniques

Masking and AI Systems

Data masking is critical for responsible AI development. When AI models are trained on lakehouse data, masking ensures that training datasets do not contain raw PII. With dynamic masking applied at the query engine level, the AI training pipeline automatically receives masked values regardless of what SQL it executes, providing a strong enforcement boundary that doesn't depend on data scientists manually de-identifying data before training.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon