ray.data.read_bigquery#

ray.data.read_bigquery(project_id: str, dataset: str | None = None, query: str | None = None, *, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None) Dataset[source]#

Create a dataset from BigQuery.

The data to read from is specified via the project_id, dataset and/or query parameters. The dataset is created from the results of executing query if a query is provided. Otherwise, the entire dataset is read.

For more information about BigQuery, see the following concepts:

This method uses the BigQuery Storage Read API which reads in parallel, with a Ray read task to handle each stream. The number of streams is determined by parallelism which can be requested from this interface or automatically chosen if unspecified (see the parallelism arg below).

Warning

The maximum query response size is 10GB. For more information, see BigQuery response too large to return.

Examples

import ray
# Users will need to authenticate beforehand (https://cloud.google.com/sdk/gcloud/reference/auth/login)
ds = ray.data.read_bigquery(
    project_id="my_project",
    query="SELECT * FROM `bigquery-public-data.samples.gsod` LIMIT 1000",
)
Parameters:
  • project_id

    The name of the associated Google Cloud Project that hosts the dataset to read. For more information, see Creating and Managing Projects.

  • dataset – The name of the dataset hosted in BigQuery in the format of dataset_id.table_id. Both the dataset_id and table_id must exist otherwise an exception will be raised.

  • parallelism – The requested parallelism of the read. If -1, it will be automatically chosen based on the available cluster resources and estimated in-memory data size.

  • ray_remote_args – kwargs passed to ray.remote in the read tasks.

Returns:

Dataset producing rows from the results of executing the query (or reading the entire dataset) on the specified BigQuery dataset.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.