ray.data.Datasource#

class ray.data.Datasource(*args, **kwds)[source]#

Bases: Generic[ray.data.block.T]

Interface for defining a custom ray.data.Dataset datasource.

To read a datasource into a dataset, use ray.data.read_datasource(). To write to a writable datasource, use Dataset.write_datasource().

See RangeDatasource and DummyOutputDatasource for examples of how to implement readable and writable datasources.

Datasource instances must be serializable, since create_reader() and do_write() are called in remote tasks.

PublicAPI: This API is stable across Ray releases.

create_reader(**read_args) ray.data.datasource.datasource.Reader[ray.data.block.T][source]#

Return a Reader for the given read arguments.

The reader object will be responsible for querying the read metadata, and generating the actual read tasks to retrieve the data blocks upon request.

Parameters

read_args – Additional kwargs to pass to the datasource impl.

prepare_read(parallelism: int, **read_args) List[ray.data.datasource.datasource.ReadTask[ray.data.block.T]][source]#

Deprecated: Please implement create_reader() instead.

Warning

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

do_write(blocks: List[ray.types.ObjectRef[Union[List[ray.data.block.T], pyarrow.Table, pandas.DataFrame, bytes]]], metadata: List[ray.data.block.BlockMetadata], ray_remote_args: Dict[str, Any], **write_args) List[ray.types.ObjectRef[Any]][source]#

Launch Ray tasks for writing blocks out to the datasource.

Parameters
  • blocks – List of data block references. It is recommended that one write task be generated per block.

  • metadata – List of block metadata.

  • ray_remote_args – Kwargs passed to ray.remote in the write tasks.

  • write_args – Additional kwargs to pass to the datasource impl.

Returns

A list of the output of the write tasks.

on_write_complete(write_results: List[Any], **kwargs) None[source]#

Callback for when a write job completes.

This can be used to β€œcommit” a write output. This method must succeed prior to write_datasource() returning to the user. If this method fails, then on_write_failed() will be called.

Parameters
  • write_results – The list of the write task results.

  • kwargs – Forward-compatibility placeholder.

on_write_failed(write_results: List[ray.types.ObjectRef[Any]], error: Exception, **kwargs) None[source]#

Callback for when a write job fails.

This is called on a best-effort basis on write failures.

Parameters
  • write_results – The list of the write task result futures.

  • error – The first error encountered.

  • kwargs – Forward-compatibility placeholder.