ray.data.Datasource#

class ray.data.Datasource[source]#

Bases: _DatasourceProjectionPushdownMixin, _DatasourcePredicatePushdownMixin

Interface for defining a custom Dataset datasource.

User may subclass this class to implement a custom datasource. The subclass should implement get_read_tasks() and estimate_inmemory_data_size() to read the data and estimate the in-memory data size, respectively.

To read a datasource into a dataset, use read_datasource().

Example

>>> from ray.data.context import DataContext
>>> class MyDatasource(Datasource):
...     def __init__(self, num_rows: int = 100):
...         super().__init__()
...         self.num_rows = num_rows
...     def get_read_tasks(
...         self,
...         parallelism: int,
...         per_task_row_limit: int | None = None,
...         data_context: DataContext | None = None,
...     ) -> List["ReadTask"]:
...         # Split num_rows across parallelism tasks
...         rows_per_task = self.num_rows // parallelism
...         return [
...             ReadTask(
...                 lambda: [pa.Table.from_pydict({"data": range(rows_per_task)})],
...                 BlockMetadata(rows_per_task, rows_per_task * 8, None, None),
...             ) for _ in range(parallelism)
...         ]
...     def estimate_inmemory_data_size(self) -> Optional[int]:
...         # Return total size for all data (independent of parallelism)
...         return self.num_rows * 8
>>> ds = MyDatasource(num_rows=100)
>>> tasks = ds.get_read_tasks(parallelism=5)
>>> len(tasks) == 5
True
>>> tasks[0].metadata.num_rows == 20
True
>>> ds.estimate_inmemory_data_size() == sum(t.metadata.size_bytes for t in tasks)
True

Methods

__init__

Initialize the datasource and its mixins.

apply_predicate

Apply a predicate to this datasource.

apply_projection

Apply a projection to this datasource.

create_reader

Deprecated: Implement get_read_tasks() and estimate_inmemory_data_size() instead.

estimate_inmemory_data_size

Return an estimate of the in-memory data size, or None if unknown.

get_column_renames

Return the column renames from the projection map.

get_name

Return a human-readable name for this datasource.

get_projection_map

Return the projection map (original column names -> final column names).

get_read_tasks

Execute the read and return read tasks.

prepare_read

Deprecated: Implement get_read_tasks() and estimate_inmemory_data_size() instead.

supports_projection_pushdown

Returns True in case Datasource supports projection operation being pushed down into the reading layer

Attributes

should_create_reader

Return True if the datasource should create a legacy reader

supports_distributed_reads

If False, only launch read tasks on the driver's node.