ray.data.read_delta_sharing_tables#

ray.data.read_delta_sharing_tables(url: str, *, limit: int | None = None, version: int | None = None, timestamp: str | None = None, json_predicate_hints: str | None = None, ray_remote_args: Dict[str, Any] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None) Dataset[source]#

Read data from a Delta Sharing table. Delta Sharing projct delta-io/delta-sharing

This function reads data from a Delta Sharing table specified by the URL. It supports various options such as limiting the number of rows, specifying a version or timestamp, and configuring concurrency.

Before calling this function, ensure that the URL is correctly formatted to point to the Delta Sharing table you want to access. Make sure you have a valid delta_share profile in the working directory.

Examples

import ray

ds = ray.data.read_delta_sharing_tables(
    url=f"your-profile.json#your-share-name.your-schema-name.your-table-name",
    limit=100000,
    version=1,
)
Parameters:
  • url – A URL under the format “<profile-file-path>#<share-name>.<schema-name>.<table-name>”. Example can be found at delta-io/delta-sharing

  • limit – A non-negative integer. Load only the limit rows if the parameter is specified. Use this optional parameter to explore the shared table without loading the entire table into memory.

  • version – A non-negative integer. Load the snapshot of the table at the specified version.

  • timestamp – A timestamp to specify the version of the table to read.

  • json_predicate_hints – Predicate hints to be applied to the table. For more details, see: delta-io/delta-sharing.

  • ray_remote_args – kwargs passed to remote() in the read tasks.

  • concurrency – The maximum number of Ray tasks to run concurrently. Set this to control the number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.

  • override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.

Returns:

A Dataset containing the queried data.

Raises:

ValueError – If the URL is not properly formatted or if there is an issue with the Delta Sharing table connection.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.