ray.data.read_bigquery#

ray.data.read_bigquery(project_id: str, dataset: str | None = None, query: str | None = None, *, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, concurrency: int | None = None, override_num_blocks: int | None = None) → Dataset[source]#

Create a dataset from BigQuery.

The data to read from is specified via the project_id, dataset and/or query parameters. The dataset is created from the results of executing query if a query is provided. Otherwise, the entire dataset is read.

For more information about BigQuery, see the following concepts:

Project id: Creating and Managing Projects
Dataset: Datasets Intro
Query: Query Syntax

This method uses the BigQuery Storage Read API which reads in parallel, with a Ray read task to handle each stream. The number of streams is determined by parallelism which can be requested from this interface or automatically chosen if unspecified (see the parallelism arg below).

Warning

The maximum query response size is 10GB.

Examples

import ray
# Users will need to authenticate beforehand (https://cloud.google.com/sdk/gcloud/reference/auth/login)
ds = ray.data.read_bigquery(
    project_id="my_project",
    query="SELECT * FROM `bigquery-public-data.samples.gsod` LIMIT 1000",
)

Parameters:

project_id –
The name of the associated Google Cloud Project that hosts the dataset to read. For more information, see Creating and Managing Projects.
dataset – The name of the dataset hosted in BigQuery in the format of dataset_id.table_id. Both the dataset_id and table_id must exist otherwise an exception will be raised.
parallelism – This argument is deprecated. Use override_num_blocks argument.
ray_remote_args – kwargs passed to ray.remote() in the read tasks.
concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.

Returns:

Dataset producing rows from the results of executing the query (or reading the entire dataset) on the specified BigQuery dataset.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.