ray.data.read_iceberg#

ray.data.read_iceberg(*, table_identifier: str, row_filter: str | BooleanExpression = None, parallelism: int = -1, selected_fields: Tuple[str, ...] = ('*',), snapshot_id: int | None = None, scan_kwargs: Dict[str, str] | None = None, catalog_kwargs: Dict[str, str] | None = None, ray_remote_args: Dict[str, Any] | None = None, num_cpus: float | None = None, num_gpus: float | None = None, memory: float | None = None, override_num_blocks: int | None = None) → Dataset[source]#

Create a Dataset from an Iceberg table.

The table to read from is specified using a fully qualified table_identifier. Using PyIceberg, any intended row filters, selection of specific fields and picking of a particular snapshot ID are applied, and the files that satisfy the query are distributed across Ray read tasks. The number of output blocks is determined by override_num_blocks which can be requested from this interface or automatically chosen if unspecified.

Tip

For more details on PyIceberg, see - URI: https://py.iceberg.apache.org/

Examples

>>> import ray
>>> from pyiceberg.expressions import EqualTo  
>>> ds = ray.data.read_iceberg( 
...     table_identifier="db_name.table_name",
...     row_filter=EqualTo("column_name", "literal_value"),
...     catalog_kwargs={"name": "default", "type": "glue"}
... )

Parameters:

table_identifier – Fully qualified table identifier (db_name.table_name)
row_filter – A PyIceberg BooleanExpression to use to filter the data prior to reading
parallelism – This argument is deprecated. Use override_num_blocks argument.
selected_fields – Which columns from the data to read, passed directly to PyIceberg’s load functions. Should be an tuple of string column names.
snapshot_id – Optional snapshot ID for the Iceberg table, by default the latest snapshot is used
scan_kwargs – Optional arguments to pass to PyIceberg’s Table.scan() function (e.g., case_sensitive, limit, etc.)
catalog_kwargs – Optional arguments to pass to PyIceberg’s catalog.load_catalog() function (e.g., name, type, etc.). For the function definition, see pyiceberg catalog.
ray_remote_args – Optional arguments to pass to ray.remote() in the read tasks.
num_cpus – The number of CPUs to reserve for each parallel read worker.
num_gpus – The number of GPUs to reserve for each parallel read worker. For example, specify num_gpus=1 to request 1 GPU for each parallel read worker.
memory – The heap memory in bytes to reserve for each parallel read worker.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources, and capped at the number of physical files to be read. You shouldn’t manually set this value in most cases.

Returns:

Dataset with rows from the Iceberg table.