Ray Data

Ray is an open-source distributed compute framework purpose-built for scaling Python machine learning and AI workloads. Ray Data is the data loading and preprocessing component of the Ray ecosystem, designed specifically to feed large-scale datasets into distributed training, hyperparameter tuning, and batch inference pipelines. In 2025, Ray Data introduced native, production-ready integration with Apache Iceberg.

Ray Data and Apache Iceberg

Ray Data's Iceberg integration is built on PyIceberg under the hood. Organizations can use:

ray.data.read_iceberg("catalog.db.table") to lazily read an Iceberg table, with Ray automatically distributing the read task across all nodes in the cluster in parallel.
dataset.write_iceberg("catalog.db.table") to write distributed model output or processed features back into Iceberg with APPEND mode.

Both functions support predicate pushdown and projection, using Iceberg's manifest-level statistics to minimize the amount of data loaded across the network before the GPU training loop begins.

The AI Lakehouse Data Pipeline

In 2025 and 2026, the most common Ray Data + Iceberg architecture is the full AI Lakehouse data pipeline:

Raw training data is curated and written to Iceberg tables using Apache Spark or Flink for ETL.
Ray Data reads those Iceberg tables, applies preprocessing steps (tokenization, normalization, feature engineering) using distributed CPU workers.
The preprocessed data is streamed directly into GPU workers running PyTorch or TensorFlow for distributed model training.
Inference results are written back to Iceberg tables, providing full data lineage from raw input through model output.

Heterogeneous Compute

Ray's core advantage over Spark for ML workloads is its native understanding of heterogeneous hardware. A single Ray cluster can simultaneously manage CPU tasks (data preprocessing), GPU tasks (training), and memory-intensive tasks (embedding), automatically scheduling work onto the appropriate hardware. This unified environment makes Ray the natural "glue layer" connecting the open Iceberg lakehouse to the AI infrastructure layer.

Ray Data and Apache Iceberg

The AI Lakehouse Data Pipeline

Heterogeneous Compute

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Ray Data

Ray Data and Apache Iceberg

The AI Lakehouse Data Pipeline

Heterogeneous Compute

Related Articles

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse