Audio data curation with Ray#
This example demonstrates how to build a scalable, end-to-end audio curation pipeline with Ray with the following steps:
Stream the English validation split of Common Voice 11.0 into a Ray Dataset.
Resample each clip to 16 kHz for compatibility with Whisper.
Transcribe the audio with the
openai/whisper-large-v3-turbo
model.Judge the educational quality of each transcription with a small Llama-3 model.
Persist only clips that score ≥ 3 to a Parquet dataset.
Because this example expresses every as a Ray transformation the same script scales seamlessly from a laptop to a multi-node GPU cluster.
Quickstart#
# Install dependencies.
pip install -q "ray[data]==2.23.0" "torch==2.5.1" "torchaudio==2.2.3" \
"transformers==4.47.1" "datasets==2.18.0"
# Run the pipeline locally.
python e2e_audio/curation.py