Spark SQL

Spark SQL is the module within Apache Spark that provides a SQL interface for structured data processing. It allows developers and data engineers to write standard ANSI SQL queries against distributed datasets, mixing SQL with the DataFrame API in the same Spark application. In the context of Apache Iceberg, Spark SQL is the primary control plane for DDL operations, DML operations, and table maintenance procedures.

DDL Operations on Iceberg Tables

Spark SQL provides the standard CREATE, ALTER, and DROP syntax extended to support Iceberg-specific features. For example, you can create a partitioned Iceberg table with:

CREATE TABLE catalog.db.events (ts TIMESTAMP, user_id STRING) USING ICEBERG PARTITIONED BY (days(ts));

Iceberg's partition evolution allows changing this partition strategy at any time without rewriting the underlying data, using Spark SQL ALTER TABLE commands.

DML and Time Travel

Spark SQL supports the full range of data manipulation on Iceberg tables including INSERT INTO, INSERT OVERWRITE, UPDATE, DELETE, and MERGE INTO. Spark SQL also exposes Iceberg's time travel capability: SELECT * FROM catalog.db.events TIMESTAMP AS OF '2024-01-01' lets analysts query historical states of the table without any data duplication.

Iceberg Stored Procedures

Perhaps the most powerful Spark SQL feature specific to Iceberg is the stored procedure system. These are invoked via the CALL statement and include the critical maintenance operations:

CALL catalog.system.rewrite_data_files('db.table') for compaction
CALL catalog.system.expire_snapshots('db.table', TIMESTAMP '...') to clean old snapshots
CALL catalog.system.rollback_to_snapshot('db.table', snapshot_id) for instant data recovery

These procedures make Spark SQL the standard administrative language for the Apache Iceberg ecosystem.

DDL Operations on Iceberg Tables

DML and Time Travel

Iceberg Stored Procedures

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Spark SQL

DDL Operations on Iceberg Tables

DML and Time Travel

Iceberg Stored Procedures

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse