ray.data.read_iceberg#
- ray.data.read_iceberg(*, table_identifier: str, row_filter: str | BooleanExpression = None, parallelism: int = -1, selected_fields: Tuple[str, ...] = ('*',), snapshot_id: int | None = None, scan_kwargs: Dict[str, str] | None = None, catalog_kwargs: Dict[str, str] | None = None, ray_remote_args: Dict[str, Any] | None = None, override_num_blocks: int | None = None) Dataset [source]#
Create a
Dataset
from an Iceberg table.The table to read from is specified using a fully qualified
table_identifier
. Using PyIceberg, any intended row filters, selection of specific fields and picking of a particular snapshot ID are applied, and the files that satisfy the query are distributed across Ray read tasks. The number of output blocks is determined byoverride_num_blocks
which can be requested from this interface or automatically chosen if unspecified.Tip
For more details on PyIceberg, see - URI: https://py.iceberg.apache.org/
Examples
>>> import ray >>> from pyiceberg.expressions import EqualTo >>> ds = ray.data.read_iceberg( ... table_identifier="db_name.table_name", ... row_filter=EqualTo("column_name", "literal_value"), ... catalog_kwargs={"name": "default", "type": "glue"} ... )
- Parameters:
table_identifier – Fully qualified table identifier (
db_name.table_name
)row_filter – A PyIceberg
BooleanExpression
to use to filter the data prior to readingparallelism – This argument is deprecated. Use
override_num_blocks
argument.selected_fields – Which columns from the data to read, passed directly to PyIceberg’s load functions. Should be an tuple of string column names.
snapshot_id – Optional snapshot ID for the Iceberg table, by default the latest snapshot is used
scan_kwargs – Optional arguments to pass to PyIceberg’s Table.scan() function (e.g., case_sensitive, limit, etc.)
catalog_kwargs – Optional arguments to pass to PyIceberg’s catalog.load_catalog() function (e.g., name, type, etc.). For the function definition, see pyiceberg catalog.
ray_remote_args – Optional arguments to pass to
ray.remote
in the read tasksoverride_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources, and capped at the number of physical files to be read. You shouldn’t manually set this value in most cases.
- Returns:
Dataset
with rows from the Iceberg table.