ray.data.from_huggingface
ray.data.from_huggingface#
- ray.data.from_huggingface(dataset: Union[datasets.Dataset, datasets.DatasetDict]) Union[ray.data.dataset.MaterializedDataset, Dict[str, ray.data.dataset.MaterializedDataset]] [source]#
Create a dataset from a Hugging Face Datasets Dataset.
This function is not parallelized, and is intended to be used with Hugging Face Datasets that are loaded into memory (as opposed to memory-mapped).
Example:
>>> import ray >>> import datasets >>> hf_dataset = datasets.load_dataset("tweet_eval", "emotion") Downloading ... >>> ray_ds = ray.data.from_huggingface(hf_dataset) >>> ray_ds {'train': MaterializedDataset( num_blocks=1, num_rows=3257, schema={text: string, label: int64} ), 'test': MaterializedDataset( num_blocks=1, num_rows=1421, schema={text: string, label: int64} ), 'validation': MaterializedDataset( num_blocks=1, num_rows=374, schema={text: string, label: int64} )} >>> ray_ds = ray.data.from_huggingface(hf_dataset["train"]) >>> ray_ds MaterializedDataset( num_blocks=1, num_rows=3257, schema={text: string, label: int64} )
- Parameters
dataset – A Hugging Face Dataset, or DatasetDict. IterableDataset is not supported.
IterableDataset
is not supported.- Returns
- Dataset holding Arrow records from the Hugging Face Dataset, or a dict of
datasets in case dataset is a DatasetDict.
PublicAPI: This API is stable across Ray releases.