Personally Identifiable Information (PII) protection is the set of practices, policies, and technical controls that prevent unauthorized access to, disclosure of, or improper use of information that can be used to identify a specific individual. In data lakehouses, PII protection is both a regulatory requirement (GDPR, CCPA, HIPAA) and a fundamental trust obligation to data subjects.

The GDPR Right to Erasure and Iceberg

One of the most technically challenging PII requirements is the GDPR "right to erasure" (also known as "right to be forgotten"), which requires that all personal data about an individual be deleted upon their request. Traditional data lakes stored in immutable Parquet files on S3 made this extremely difficult: how do you delete a specific person's data from thousands of Parquet files containing billions of rows?

Apache Iceberg's row-level delete support (using delete files and equality delete files) solves this problem. A targeted DELETE FROM orders WHERE customer_id = '12345' executes as an ACID transaction, creating delete files that logically remove the affected rows. Subsequent compaction operations physically rewrite the Parquet files without those rows, completing the erasure. Iceberg V3's row lineage feature makes identifying all rows belonging to a specific individual even more efficient.

PII Classification and Discovery

Effective PII protection begins with knowing where PII lives. Tools like AWS Macie, Google Cloud DLP, and catalog-integrated discovery tools scan Iceberg table metadata and data samples to automatically classify columns as PII (name, email, phone, SSN, credit card), near-PII (date of birth, zip code), or non-sensitive. Once classified, column tags in the catalog trigger automated masking policies, access controls, and audit logging for any access to PII-tagged columns.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon