ray.data.read_databricks_tables#

ray.data.read_databricks_tables(*, warehouse_id: str, table: str | None = None, query: str | None = None, catalog: str | None = None, schema: str | None = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None) Dataset[source]#

Read a Databricks unity catalog table or Databricks SQL execution result.

Before calling this API, set the DATABRICKS_TOKEN environment variable to your Databricks warehouse access token.

export DATABRICKS_TOKEN=...

If you’re not running your program on the Databricks runtime, also set the DATABRICKS_HOST environment variable.

export DATABRICKS_HOST=adb-<workspace-id>.<random-number>.azuredatabricks.net

Note

This function is built on the Databricks statement execution API.

Examples

import ray

ds = ray.data.read_databricks_tables(
    warehouse_id='...',
    catalog='catalog_1',
    schema='db_1',
    query='select id from table_1 limit 750000',
)
Parameters:
  • warehouse_id – The ID of the Databricks warehouse. The query statement is executed on this warehouse.

  • table – The name of UC table you want to read. If this argument is set, you can’t set query argument, and the reader generates query of select * from {table_name} under the hood.

  • query – The query you want to execute. If this argument is set, you can’t set table_name argument.

  • catalog – (Optional) The default catalog name used by the query.

  • schema – (Optional) The default schema used by the query.

  • parallelism – This argument is deprecated. Use override_num_blocks argument.

  • ray_remote_args – kwargs passed to ray.remote() in the read tasks.

  • concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.

  • override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.

Returns:

A Dataset containing the queried data.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.