ray.data.read_delta#
- ray.data.read_delta(path: str | List[str], version: int | None = None, *, storage_options: Dict[str, Any] | None = None, filesystem: pyarrow.fs.FileSystem | None = None, columns: List[str] | None = None, parallelism: int = -1, num_cpus: float | None = None, num_gpus: float | None = None, memory: float | None = None, ray_remote_args: Dict[str, Any] | None = None, shuffle: Literal['files'] | None = None, include_paths: bool = False, concurrency: int | None = None, override_num_blocks: int | None = None, **arrow_parquet_args)[source]#
Creates a
Datasetfrom a Delta Lake table.This reader uses the
deltalakelibrary to read the Delta transaction log and constructs a PyArrow dataset that preserves the table’s unified schema, partition information, and column statistics. This enables:Schema evolution support (older files with missing columns are null-filled)
Correct handling of cloud storage URIs (Azure, S3, GCS)
Column statistics from the Delta log for row-group pruning
Authentication via
storage_options
Examples
Read a local Delta table:
>>> import ray >>> ds = ray.data.read_delta("/path/to/delta-table/")
Read from S3 with credentials:
>>> ds = ray.data.read_delta( ... "s3://bucket/delta-table/", ... storage_options={ ... "AWS_ACCESS_KEY_ID": "...", ... "AWS_SECRET_ACCESS_KEY": "...", ... }, ... )
Read from Azure with default credentials:
>>> ds = ray.data.read_delta( ... "az://container/delta-table/", ... storage_options={"use_azure_cli": "true"}, ... )
- Parameters:
path – A single path to a Delta Lake table. Multiple tables are not supported.
version – The version of the Delta Lake table to read. If not specified, the latest version is read.
storage_options – A dictionary of storage options passed to the
deltalakelibrary for authentication and configuration. Supported keys depend on the storage backend: S3 options, Azure options, GCS options.filesystem – The PyArrow filesystem implementation to read from. These filesystems are specified in the pyarrow docs. Specify this parameter if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with
s3://, theS3FileSystemis used. IfNone, this function uses a system-chosen implementation.columns – A list of column names to read. Only the specified columns are read during the file scan.
parallelism – This argument is deprecated. Use
override_num_blocksargument.num_cpus – The number of CPUs to reserve for each parallel read worker.
num_gpus – The number of GPUs to reserve for each parallel read worker. For example, specify
num_gpus=1to request 1 GPU for each parallel read worker.memory – The heap memory in bytes to reserve for each parallel read worker.
ray_remote_args – kwargs passed to
remote()in the read tasks.shuffle – If setting to “files”, randomly shuffle input files order before read. Defaults to not shuffle with
None.include_paths – If
True, include the path to each file. File paths are stored in the'path'column.concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.
**arrow_parquet_args – Other parquet read options to pass to PyArrow. For the full set of arguments, see the PyArrow API
- Returns:
Datasetproducing records read from the specified Delta Lake table.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.