ray.data.from_huggingface#

ray.data.from_huggingface(dataset: datasets.Dataset | datasets.IterableDataset) MaterializedDataset | Dataset[source]#

Create a MaterializedDataset from a Hugging Face Datasets Dataset or a Dataset from a Hugging Face Datasets IterableDataset. For an IterableDataset, we use a streaming implementation to read data.

Example

import ray
import datasets

hf_dataset = datasets.load_dataset("tweet_eval", "emotion")
ray_ds = ray.data.from_huggingface(hf_dataset["train"])
print(ray_ds)

hf_dataset_stream = datasets.load_dataset("tweet_eval", "emotion", streaming=True)
ray_ds_stream = ray.data.from_huggingface(hf_dataset_stream["train"])
print(ray_ds_stream)
MaterializedDataset(
    num_blocks=...,
    num_rows=3257,
    schema={text: string, label: int64}
)
Dataset(
    num_blocks=...,
    num_rows=3257,
    schema={text: string, label: int64}
)
Parameters:

dataset – A Hugging Face Datasets Dataset or Hugging Face Datasets IterableDataset. DatasetDict and IterableDatasetDict are not supported.

Returns:

A Dataset holding rows from the Hugging Face Datasets Dataset.