ray.data.read_unity_catalog#
- ray.data.read_unity_catalog(table: str, url: str, token: str, *, data_format: str | None = None, region: str | None = None, reader_kwargs: dict | None = None) Dataset[source]#
Loads a Unity Catalog table or files into a Ray Dataset using Databricks Unity Catalog credential vending, with automatic short-lived cloud credential handoff for secure, parallel, distributed access from external engines.
This function works by leveraging Unity Catalog’s credential vending feature, which grants temporary, least-privilege credentials for the cloud storage location backing the requested table or data files. It authenticates via the Unity Catalog REST API (Unity Catalog credential vending for external system access, Databricks Docs), ensuring that permissions are enforced at the Databricks principal (user, group, or service principal) making the request. The function supports reading data directly from AWS S3, Azure Data Lake, or GCP GCS in standard formats including Delta and Parquet.
Note
This function is experimental and under active development.
Examples
Read a Unity Catalog Delta table:
>>> import ray >>> ds = ray.data.read_unity_catalog( ... table="main.sales.transactions", ... url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com", ... token="dapi...", ... region="us-west-2" ... ) >>> ds.show(3)
- Parameters:
table – Unity Catalog table path in format
catalog.schema.table.url – Databricks workspace URL (e.g.,
"https://dbc-XXXXXXX-XXXX.cloud.databricks.com").token – Databricks Personal Access Token with
EXTERNAL USE SCHEMApermission.data_format – Data format (
"delta"or"parquet"). If not specified, inferred from table metadata.region – AWS region for S3 access (e.g.,
"us-west-2"). Required for AWS, not needed for Azure/GCP.reader_kwargs – Additional arguments passed to the underlying Ray Data reader.
- Returns:
A
Datasetcontaining the data from Unity Catalog.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.