ray.data.read_unity_catalog#

ray.data.read_unity_catalog(table: str, url: str | None = None, token: str | None = None, *, credential_provider: DatabricksCredentialProvider | None = None, data_format: str | None = None, region: str | None = None, reader_kwargs: dict | None = None) Dataset[source]#

Loads a Unity Catalog table or files into a Ray Dataset using Databricks Unity Catalog credential vending, with automatic short-lived cloud credential handoff for secure, parallel, distributed access from external engines.

This function works by leveraging Unity Catalog’s credential vending feature, which grants temporary, least-privilege credentials for the cloud storage location backing the requested table or data files. It authenticates via the Unity Catalog REST API (Unity Catalog credential vending for external system access, Databricks Docs), ensuring that permissions are enforced at the Databricks principal (user, group, or service principal) making the request. The function supports reading data directly from AWS S3, Azure Data Lake, or GCP GCS in standard formats including Delta and Parquet.

Note

This function is experimental and under active development.

Examples

Read a Unity Catalog Delta table:

>>> import ray
>>> ds = ray.data.read_unity_catalog(  
...     table="main.sales.transactions",
...     url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
...     token="dapi...",
...     region="us-west-2"
... )
>>> ds.show(3)  

Read using a custom credential provider:

>>> from ray.data._internal.datasource.databricks_credentials import (  
...     StaticCredentialProvider,
... )
>>> provider = StaticCredentialProvider(  
...     token="dapi...",
...     host="https://dbc-XXXXXXX-XXXX.cloud.databricks.com",
... )
>>> ds = ray.data.read_unity_catalog(  
...     table="main.sales.transactions",
...     credential_provider=provider,
...     region="us-west-2"
... )
Parameters:
  • table – Unity Catalog table path in format catalog.schema.table.

  • url – Databricks workspace URL (e.g., "https://dbc-XXXXXXX-XXXX.cloud.databricks.com"). Required if credential_provider is not specified. Please prefer to use credential_provider instead of url and token parameters. This parameter will be deprecated in a future release.

  • token – Databricks Personal Access Token with EXTERNAL USE SCHEMA permission. Required if credential_provider is not specified. Please prefer to use credential_provider instead of url and token parameters. This parameter will be deprecated in a future release.

  • credential_provider – (Optional) A custom credential provider for authentication. Must be a subclass of DatabricksCredentialProvider implementing get_token(), get_host(), and invalidate(). The provider must be picklable (serializable) as it is sent to Ray workers for distributed execution. If provided, the provider is used exclusively and url/token parameters are ignored.

  • data_format – Data format ("delta" or "parquet"). If not specified, inferred from table metadata.

  • region – AWS region for S3 access (e.g., "us-west-2"). Required for AWS, not needed for Azure/GCP.

  • reader_kwargs – Additional arguments passed to the underlying Ray Data reader.

Returns:

A Dataset containing the data from Unity Catalog.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.