ray.data.Datasource#
- class ray.data.Datasource[source]#
Bases:
_DatasourceProjectionPushdownMixin,_DatasourcePredicatePushdownMixinInterface for defining a custom
Datasetdatasource.User may subclass this class to implement a custom datasource. The subclass should implement
get_read_tasks()andestimate_inmemory_data_size()to read the data and estimate the in-memory data size, respectively.To read a datasource into a dataset, use
read_datasource().Example
>>> from ray.data.context import DataContext >>> class MyDatasource(Datasource): ... def __init__(self, num_rows: int = 100): ... super().__init__() ... self.num_rows = num_rows ... def get_read_tasks( ... self, ... parallelism: int, ... per_task_row_limit: int | None = None, ... data_context: DataContext | None = None, ... ) -> List["ReadTask"]: ... # Split num_rows across parallelism tasks ... rows_per_task = self.num_rows // parallelism ... return [ ... ReadTask( ... lambda: [pa.Table.from_pydict({"data": range(rows_per_task)})], ... BlockMetadata(rows_per_task, rows_per_task * 8, None, None), ... ) for _ in range(parallelism) ... ] ... def estimate_inmemory_data_size(self) -> Optional[int]: ... # Return total size for all data (independent of parallelism) ... return self.num_rows * 8 >>> ds = MyDatasource(num_rows=100) >>> tasks = ds.get_read_tasks(parallelism=5) >>> len(tasks) == 5 True >>> tasks[0].metadata.num_rows == 20 True >>> ds.estimate_inmemory_data_size() == sum(t.metadata.size_bytes for t in tasks) True
Methods
Initialize the datasource and its mixins.
Apply a predicate to this datasource.
Apply a projection to this datasource.
Deprecated: Implement
get_read_tasks()andestimate_inmemory_data_size()instead.Return an estimate of the in-memory data size, or None if unknown.
Return the column renames from the projection map.
Return a human-readable name for this datasource.
Return the projection map (original column names -> final column names).
Execute the read and return read tasks.
Deprecated: Implement
get_read_tasks()andestimate_inmemory_data_size()instead.Returns
Truein caseDatasourcesupports projection operation being pushed down into the reading layerAttributes
Return True if the datasource should create a legacy reader
If
False, only launch read tasks on the driver's node.