Audio data curation with Ray#

   

This example demonstrates how to build a scalable, end-to-end audio curation pipeline with Ray with the following steps:

  1. Stream the English validation split of Common Voice 11.0 into a Ray Dataset.

  2. Resample each clip to 16 kHz for compatibility with Whisper.

  3. Transcribe the audio with the openai/whisper-large-v3-turbo model.

  4. Judge the educational quality of each transcription with a small Llama-3 model.

  5. Persist only clips that score ≥ 3 to a Parquet dataset.

Because this example expresses every as a Ray transformation the same script scales seamlessly from a laptop to a multi-node GPU cluster.

Quickstart#

# Install dependencies.
pip install -q "ray[data]==2.23.0" "torch==2.5.1" "torchaudio==2.2.3" \
              "transformers==4.47.1" "datasets==2.18.0"

# Run the pipeline locally.
python e2e_audio/curation.py