ray.data.Dataset.write_iceberg#
- Dataset.write_iceberg(table_identifier: str, catalog_kwargs: Dict[str, Any] | None = None, snapshot_properties: Dict[str, str] | None = None, ray_remote_args: Dict[str, Any] = None, concurrency: int | None = None) None [source]#
Writes the
Dataset
to an Iceberg table.Tip
For more details on PyIceberg, see - URI: https://py.iceberg.apache.org/
Note
This operation will trigger execution of the lazy transformations performed on this dataset.
Examples
import ray import pandas as pd docs = [{"title": "Iceberg data sink test"} for key in range(4)] ds = ray.data.from_pandas(pd.DataFrame(docs)) ds.write_iceberg( table_identifier="db_name.table_name", catalog_kwargs={"name": "default", "type": "sql"} )
- Parameters:
table_identifier – Fully qualified table identifier (
db_name.table_name
)catalog_kwargs – Optional arguments to pass to PyIceberg’s catalog.load_catalog() function (e.g., name, type, etc.). For the function definition, see pyiceberg catalog.
snapshot_properties – custom properties write to snapshot when committing
table. (to an iceberg)
ray_remote_args – kwargs passed to
ray.remote()
in the write tasks.concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run. By default, concurrency is dynamically decided based on the available resources.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.