ray.data.Dataset.write_bigquery#

Dataset.write_bigquery(project_id: str, dataset: str, max_retry_cnt: int = 10, overwrite_table: bool | None = True, ray_remote_args: Dict[str, Any] = None, concurrency: int | None = None) None[source]#

Write the dataset to a BigQuery dataset table.

To control the number of parallel write tasks, use .repartition() before calling this method.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

import ray
import pandas as pd

docs = [{"title": "BigQuery Datasource test"} for key in range(4)]
ds = ray.data.from_pandas(pd.DataFrame(docs))
ds.write_bigquery(
    project_id="my_project_id",
    dataset="my_dataset_table",
    overwrite_table=True
)
Parameters:
  • project_id – The name of the associated Google Cloud Project that hosts the dataset to read. For more information, see details in Creating and managing projects.

  • dataset – The name of the dataset in the format of dataset_id.table_id. The dataset is created if it doesn’t already exist.

  • max_retry_cnt – The maximum number of retries that an individual block write is retried due to BigQuery rate limiting errors. This isn’t related to Ray fault tolerance retries. The default number of retries is 10.

  • overwrite_table – Whether the write will overwrite the table if it already exists. The default behavior is to overwrite the table. overwrite_table=False will append to the table if it exists.

  • ray_remote_args – Kwargs passed to ray.remote() in the write tasks.

  • concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run. By default, concurrency is dynamically decided based on the available resources.