ray.data.from_huggingface#

ray.data.from_huggingface(dataset: Union[datasets.Dataset, datasets.DatasetDict]) Union[ray.data.dataset.MaterializedDataset, Dict[str, ray.data.dataset.MaterializedDataset]][source]#

Create a dataset from a Hugging Face Datasets Dataset.

This function is not parallelized, and is intended to be used with Hugging Face Datasets that are loaded into memory (as opposed to memory-mapped).

Example:

>>> import ray
>>> import datasets
>>> hf_dataset = datasets.load_dataset("tweet_eval", "emotion")
Downloading ...
>>> ray_ds = ray.data.from_huggingface(hf_dataset)
>>> ray_ds
{'train': MaterializedDataset(
   num_blocks=1,
   num_rows=3257,
   schema={text: string, label: int64}
), 'test': MaterializedDataset(
   num_blocks=1,
   num_rows=1421,
   schema={text: string, label: int64}
), 'validation': MaterializedDataset(
   num_blocks=1,
   num_rows=374,
   schema={text: string, label: int64}
)}
>>> ray_ds = ray.data.from_huggingface(hf_dataset["train"])
>>> ray_ds
MaterializedDataset(
   num_blocks=1,
   num_rows=3257,
   schema={text: string, label: int64}
)
Parameters

dataset – A Hugging Face Dataset, or DatasetDict. IterableDataset is not supported. IterableDataset is not supported.

Returns

Dataset holding Arrow records from the Hugging Face Dataset, or a dict of

datasets in case dataset is a DatasetDict.

PublicAPI: This API is stable across Ray releases.