ray.data.from_huggingface#

ray.data.from_huggingface(dataset: Union[datasets.Dataset, datasets.IterableDataset]) Union[ray.data.dataset.MaterializedDataset, ray.data.dataset.Dataset][source]#

Create a MaterializedDataset from a Hugging Face Datasets Dataset or a Dataset from a Hugging Face Datasets IterableDataset. For an IterableDataset, we use a streaming implementation to read data.

Example

import ray
import datasets

hf_dataset = datasets.load_dataset("tweet_eval", "emotion")
ray_ds = ray.data.from_huggingface(hf_dataset["train"])
print(ray_ds)

hf_dataset_stream = datasets.load_dataset("tweet_eval", "emotion", streaming=True)
ray_ds_stream = ray.data.from_huggingface(hf_dataset_stream["train"])
print(ray_ds_stream)
MaterializedDataset(
    num_blocks=...,
    num_rows=3257,
    schema={text: string, label: int64}
)
Dataset(
    num_blocks=...,
    num_rows=3257,
    schema={text: string, label: int64}
)
Parameters

dataset – A Hugging Face Datasets Dataset or Hugging Face Datasets IterableDataset. DatasetDict and IterableDatasetDict are not supported.

Returns

A Dataset holding rows from the Hugging Face Datasets Dataset.