ray.data.Dataset.to_dask#

Dataset.to_dask(meta: pandas.DataFrame | pandas.Series | Dict[str, Any] | Iterable[Any] | Tuple[Any] | None = None, verify_meta: bool = True) dask.dataframe.DataFrame[source]#

Convert this Dataset into a Dask DataFrame.

This is only supported for datasets convertible to Arrow records.

Note that this function will set the Dask scheduler to Dask-on-Ray globally, via the config.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Time complexity: O(dataset size / parallelism)

Parameters:
  • meta – An empty pandas DataFrame or Series that matches the dtypes and column names of the stream. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. By default, this is inferred from the underlying Dataset schema, with this argument supplying an optional override.

  • verify_meta – If True, Dask will check that the partitions have consistent metadata. Defaults to True.

Returns:

A Dask DataFrame created from this dataset.