ray.data.Dataset.write_bigquery#
- Dataset.write_bigquery(project_id: str, dataset: str, max_retry_cnt: int = 10, overwrite_table: bool | None = True, ray_remote_args: Dict[str, Any] = None, concurrency: int | None = None) None [source]#
Write the dataset to a BigQuery dataset table.
To control the number of parallel write tasks, use
.repartition()
before calling this method.Note
This operation will trigger execution of the lazy transformations performed on this dataset.
Examples
import ray import pandas as pd docs = [{"title": "BigQuery Datasource test"} for key in range(4)] ds = ray.data.from_pandas(pd.DataFrame(docs)) ds.write_bigquery( project_id="my_project_id", dataset="my_dataset_table", overwrite_table=True )
- Parameters:
project_id – The name of the associated Google Cloud Project that hosts the dataset to read. For more information, see details in Creating and managing projects.
dataset – The name of the dataset in the format of
dataset_id.table_id
. The dataset is created if it doesn’t already exist.max_retry_cnt – The maximum number of retries that an individual block write is retried due to BigQuery rate limiting errors. This isn’t related to Ray fault tolerance retries. The default number of retries is 10.
overwrite_table – Whether the write will overwrite the table if it already exists. The default behavior is to overwrite the table.
overwrite_table=False
will append to the table if it exists.ray_remote_args – Kwargs passed to
ray.remote()
in the write tasks.concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run. By default, concurrency is dynamically decided based on the available resources.