ray.data.read_databricks_tables#

ray.data.read_databricks_tables(*, warehouse_id: str, table: str | None = None, query: str | None = None, catalog: str | None = None, schema: str | None = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] | None = None) Dataset[source]#

Read a Databricks unity catalog table or Databricks SQL execution result.

Before calling this API, set the DATABRICKS_TOKEN environment variable to your Databricks warehouse access token.

export DATABRICKS_TOKEN=...

If you’re not running your program on the Databricks runtime, also set the DATABRICKS_HOST environment variable.

export DATABRICKS_HOST=adb-<workspace-id>.<random-number>.azuredatabricks.net

Note

This function is built on the Databricks statement execution API.

Examples

import ray

ds = ray.data.read_databricks_tables(
    warehouse_id='...',
    catalog='catalog_1',
    schema='db_1',
    query='select id from table_1 limit 750000',
)
Parameters:
  • warehouse_id – The ID of the Databricks warehouse. The query statement is executed on this warehouse.

  • table – The name of UC table you want to read. If this argument is set, you can’t set query argument, and the reader generates query of select * from {table_name} under the hood.

  • query – The query you want to execute. If this argument is set, you can’t set table_name argument.

  • catalog – (Optional) The default catalog name used by the query.

  • schema – (Optional) The default schema used by the query.

  • parallelism – The requested parallelism of the read. Defaults to -1, which automatically determines the optimal parallelism for your configuration. You should not need to manually set this value in most cases. For details on how the parallelism is automatically determined and guidance on how to tune it, see Tuning read parallelism.

  • ray_remote_args – kwargs passed to remote() in the read tasks.

Returns:

A Dataset containing the queried data.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.