Ray is an open-source distributed compute framework purpose-built for scaling Python machine learning and AI workloads. Ray Data is the data loading and preprocessing component of the Ray ecosystem, designed specifically to feed large-scale datasets into distributed training, hyperparameter tuning, and batch inference pipelines. In 2025, Ray Data introduced native, production-ready integration with Apache Iceberg.

Ray Data and Apache Iceberg

Ray Data's Iceberg integration is built on PyIceberg under the hood. Organizations can use:

Both functions support predicate pushdown and projection, using Iceberg's manifest-level statistics to minimize the amount of data loaded across the network before the GPU training loop begins.

The AI Lakehouse Data Pipeline

In 2025 and 2026, the most common Ray Data + Iceberg architecture is the full AI Lakehouse data pipeline:

Heterogeneous Compute

Ray's core advantage over Spark for ML workloads is its native understanding of heterogeneous hardware. A single Ray cluster can simultaneously manage CPU tasks (data preprocessing), GPU tasks (training), and memory-intensive tasks (embedding), automatically scheduling work onto the appropriate hardware. This unified environment makes Ray the natural "glue layer" connecting the open Iceberg lakehouse to the AI infrastructure layer.

Master the Agentic Lakehouse

Architecting an Apache Iceberg Lakehouse

Architecting an Apache Iceberg Lakehouse

Buy on Manning
The AI Lakehouse

The AI Lakehouse

Buy on Amazon