ray.data.read_delta_sharing_tables#
- ray.data.read_delta_sharing_tables(url: str, *, limit: int | None = None, version: int | None = None, timestamp: str | None = None, json_predicate_hints: str | None = None, ray_remote_args: Dict[str, Any] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None) Dataset [source]#
Read data from a Delta Sharing table. Delta Sharing projct delta-io/delta-sharing
This function reads data from a Delta Sharing table specified by the URL. It supports various options such as limiting the number of rows, specifying a version or timestamp, and configuring concurrency.
Before calling this function, ensure that the URL is correctly formatted to point to the Delta Sharing table you want to access. Make sure you have a valid delta_share profile in the working directory.
Examples
import ray ds = ray.data.read_delta_sharing_tables( url=f"your-profile.json#your-share-name.your-schema-name.your-table-name", limit=100000, version=1, )
- Parameters:
url – A URL under the format “<profile-file-path>#<share-name>.<schema-name>.<table-name>”. Example can be found at delta-io/delta-sharing
limit – A non-negative integer. Load only the
limit
rows if the parameter is specified. Use this optional parameter to explore the shared table without loading the entire table into memory.version – A non-negative integer. Load the snapshot of the table at the specified version.
timestamp – A timestamp to specify the version of the table to read.
json_predicate_hints – Predicate hints to be applied to the table. For more details, see: delta-io/delta-sharing.
ray_remote_args – kwargs passed to
ray.remote()
in the read tasks.concurrency – The maximum number of Ray tasks to run concurrently. Set this to control the number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.
- Returns:
A
Dataset
containing the queried data.- Raises:
ValueError – If the URL is not properly formatted or if there is an issue with the Delta Sharing table connection.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.