PySpark is the official Python interface to Apache Spark. Rather than writing Spark jobs in Scala (Spark's native language), data engineers and data scientists can author distributed data processing pipelines using familiar Python syntax. PySpark translates Python code into Spark's distributed execution model, allowing a single Python script to orchestrate computation across hundreds of nodes in a cloud cluster.
PySpark and Apache Iceberg
Configuring PySpark to interact with Apache Iceberg requires adding the Iceberg Spark runtime JAR to the Spark session configuration. Once set up, engineers use the standard PySpark DataFrame API or Spark SQL to interact with Iceberg tables. Common patterns include:
- Reading historical data with
spark.read.format("iceberg").option("snapshot-id", id).load("catalog.db.table") - Performing schema evolution or partition evolution through Spark SQL DDL statements
- Running maintenance procedures via
spark.sql("CALL catalog.system.rewrite_data_files(...)")
PySpark in AI and ML Pipelines
PySpark's Python-native interface makes it the natural bridge between the data engineering world and the machine learning world. Data scientists can use PySpark to retrieve and transform large Iceberg datasets as training data, then seamlessly pass those DataFrames to MLlib (Spark's built-in ML library) or convert them to Pandas DataFrames for use with scikit-learn, XGBoost, or PyTorch. In agentic AI architectures, PySpark pipelines are often the mechanism that prepares large, structured training corpora stored in open Iceberg lakehouses.
PyIceberg: The Alternative
For lightweight Iceberg operations that don't require full Spark cluster overhead, the PyIceberg library provides a pure-Python Iceberg client. PyIceberg allows Python developers to read, write, and manage Iceberg tables from a local machine or serverless function, making it a complementary tool in the Python data engineer's toolkit when Spark's full distributed power is not required.

