ray.data.read_datasource#

ray.data.read_datasource(datasource: Datasource, *, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, **read_args) Dataset[source]#

Read a stream from a custom Datasource.

Parameters:
  • datasource – The Datasource to read data from.

  • parallelism – The requested parallelism of the read. Parallelism might be limited by the available partitioning of the datasource. If set to -1, parallelism is automatically chosen based on the available cluster resources and estimated in-memory data size.

  • read_args – Additional kwargs to pass to the Datasource implementation.

  • ray_remote_args – kwargs passed to ray.remote() in the read tasks.

Returns:

Dataset that reads data from the Datasource.