ray.data.from_pandas#

ray.data.from_pandas(dfs: pandas.DataFrame | List[pandas.DataFrame], override_num_blocks: int | None = None) MaterializedDataset[source]#

Create a Dataset from a list of pandas dataframes.

Examples

>>> import pandas as pd
>>> import ray
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> ray.data.from_pandas(df)
MaterializedDataset(num_blocks=1, num_rows=3, schema={a: int64, b: int64})

Create a Ray Dataset from a list of Pandas DataFrames.

>>> ray.data.from_pandas([df, df])
MaterializedDataset(num_blocks=2, num_rows=6, schema={a: int64, b: int64})
Parameters:
  • dfs – A pandas dataframe or a list of pandas dataframes.

  • override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.

Returns:

Dataset holding data read from the dataframes.