ray.data.read_bigquery#
- ray.data.read_bigquery(project_id: str, dataset: str | None = None, query: str | None = None, *, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, concurrency: int | None = None, override_num_blocks: int | None = None) Dataset [source]#
Create a dataset from BigQuery.
The data to read from is specified via the
project_id
,dataset
and/orquery
parameters. The dataset is created from the results of executingquery
if a query is provided. Otherwise, the entiredataset
is read.For more information about BigQuery, see the following concepts:
Project id: Creating and Managing Projects
Dataset: Datasets Intro
Query: Query Syntax
This method uses the BigQuery Storage Read API which reads in parallel, with a Ray read task to handle each stream. The number of streams is determined by
parallelism
which can be requested from this interface or automatically chosen if unspecified (see theparallelism
arg below).Warning
The maximum query response size is 10GB. For more information, see BigQuery response too large to return.
Examples
import ray # Users will need to authenticate beforehand (https://cloud.google.com/sdk/gcloud/reference/auth/login) ds = ray.data.read_bigquery( project_id="my_project", query="SELECT * FROM `bigquery-public-data.samples.gsod` LIMIT 1000", )
- Parameters:
project_id –
The name of the associated Google Cloud Project that hosts the dataset to read. For more information, see Creating and Managing Projects.
dataset – The name of the dataset hosted in BigQuery in the format of
dataset_id.table_id
. Both the dataset_id and table_id must exist otherwise an exception will be raised.parallelism – This argument is deprecated. Use
override_num_blocks
argument.ray_remote_args – kwargs passed to ray.remote in the read tasks.
concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.
- Returns:
Dataset producing rows from the results of executing the query (or reading the entire dataset) on the specified BigQuery dataset.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.