Ray Data Benchmarks#

This page documents benchmark results and methodologies for evaluating Ray Data performance across a variety of data modalities and workloads.

Workload Summary#

Image Classification: Processing 800k ImageNet images using ResNet18. The pipeline downloads images, deserializes them, applies transformations, runs ResNet18 inference on GPU, and outputs predicted labels.
Document Embedding: Processing 10k PDF documents from Digital Corpora. The pipeline reads PDF documents, extracts text page-by-page, splits into chunks with overlap, embeds using a all-MiniLM-L6-v2 model on GPU, and outputs embeddings with metadata.
Audio Transcription: Transcribing 113,800 audio files from Mozilla Common Voice 17 dataset using a Whisper-tiny model. The pipeline loads FLAC audio files, resamples to 16kHz, extracts features using Whisper’s processor, runs GPU-accelerated batch inference with the model, and outputs transcriptions with metadata.
Video Object Detection: Processing 10k video frames from Hollywood2 action videos dataset using YOLOv11n for object detection. The pipeline loads video frames, resizes them to 640x640, runs batch inference with YOLO to detect objects, extracts individual object crops, and outputs object metadata and cropped images in Parquet format.
Large-scale Image Embedding: Processing 4TiB of base64-encoded images from a Parquet dataset using ViT for image embedding. The pipeline decodes base64 images, converts to RGB, preprocesses using ViTImageProcessor (resizing, normalization), runs GPU-accelerated batch inference with ViT to generate embeddings, and outputs results to Parquet format.

Ray Data 2.50 is compared with Daft 0.6.2, an open source multimodal data processing library built on Ray.

Results Summary#

Multimodal Inference Benchmark Results

Workload	Daft (s)	Ray Data (s)
Image Classification	195.3 ± 2.5	111.2 ± 1.2
Document Embedding	51.3 ± 1.3	29.4 ± 0.8
Audio Transcription	510.5 ± 10.4	312.6 ± 3.1
Video Object Detection	735.3 ± 7.6	623 ± 1.4
Large Scale Image Embedding	752.75 ± 5.5	105.81 ± 0.79

All benchmark results are taken from an average/std across 4 runs. A warmup was also run to download the model and remove any startup overheads that would affect the result.

Workload Configuration#

Workload	Dataset	Data Path	Cluster Configuration	Code
Image Classification	800k images from ImageNet	s3://ray-example-data/imagenet/metadata_file.parquet	1 head / 8 workers of varying instance types	Link
Document Embedding	10k PDFs from Digital Corpora	s3://ray-example-data/digitalcorpora/metadata	g6.xlarge head, 8 g6.xlarge workers	Link
Audio Transcription	113,800 audio files from Mozilla Common Voice 17 en dataset	s3://air-example-data/common_voice_17/parquet/	g6.xlarge head, 8 g6.xlarge workers	Link
Video Object Detection	1,000 videos from Hollywood-2 Human Actions dataset	s3://ray-example-data/videos/Hollywood2-actions-videos/Hollywood2/AVIClips/	1 head, 8 workers of varying instance types	Link
Large-scale Image Embedding	4 TiB of Parquet files containing base64 encoded images	s3://ray-example-data/image-datasets/10TiB-b64encoded-images-in-parquet-v3/	m5.24xlarge (head), 40 g6e.xlarge (gpu workers), 64 r6i.8xlarge (cpu workers)	Link

Image Classification across different instance types#

This experiment compares the performance of Ray Data with Daft on the image classification workload across a variety of instance types. Each run is an average/std across 3 runs. A warmup was also run to download the model and remove any startup overheads that would affect the result.

	g6.xlarge (4 CPUs)	g6.2xlarge (8 CPUs)	g6.4xlarge (16 CPUs)	g6.8xlarge (32 CPUs)
Ray Data (s)	456.2 ± 39.9	195.5 ± 7.6	144.8 ± 1.9	111.2 ± 1.2
Daft (s)	315.0 ± 31.2	202.0 ± 2.2	195.0 ± 6.6	195.3 ± 2.5

Video Object Detection across different instance types#

This experiment compares the performance of Ray Data with Daft on the video object detection workload across a variety of instance types. Each run is an average/std across 4 runs. A warmup was also run to download the model and remove any startup overheads that would affect the result.

	g6.xlarge (4 CPUs)	g6.2xlarge (8 CPUs)	g6.4xlarge (16 CPUs)	g6.8xlarge (32 CPUs)
Ray Data (s)	922 ± 13.8	704.8 ± 25.0	629 ± 1.8	623 ± 1.4
Daft (s)	758.8 ± 10.4	735.3 ± 7.6	747.5 ± 13.4	771.3 ± 25.6