ray.data.read_zarr#

ray.data.read_zarr(path: str, *, filesystem: pyarrow.fs.FileSystem | fsspec.spec.AbstractFileSystem | None = None, chunk_shapes: dict[str, list] | list | None = None, array_paths: list[str] | None = None, allow_full_metadata_scan: bool = False, align_axis_0: bool = False, overlap: int = 0, concurrency: int | None = None, override_num_blocks: int | None = None, num_cpus: float | None = None, num_gpus: float | None = None, memory: float | None = None, ray_remote_args: Dict[str, Any] | None = None)[source]#

Creates a Dataset from a Zarr v2 store.

By default each row is one chunk of one array (long-form), with columns array, chunk_index, chunk_slices, and chunk. With align_axis_0=True, each row is one axis-0 chunk with t_start, t_stop, and one column per selected array (wide-form), for arrays that share shape[0].

For the output schemas, chunk re-tiling, aligned and sliding-window reads, metadata discovery, custom codecs, and cloud-storage setup, see Working with Zarr.

Note

In long-form the chunk column is a tensor, and tensors of different rank or dtype can’t be combined into one batch. Consume long-form per array (filter on the array column first), or, when arrays are row-aligned (share shape[0]), use align_axis_0=True so each array is its own column – which is batch-safe.

Examples

Read every array at its native chunking (long-form, one row per chunk):

>>> import ray
>>> ds = ray.data.read_zarr(  
...     "s3://anonymous@ray-example-data/mnist-tiny.zarr",
... )

Aligned read – paired (images, labels) per row; align_axis_0 requires all selected arrays to share shape[0]:

>>> ds = ray.data.read_zarr(  
...     "s3://anonymous@ray-example-data/mnist-tiny.zarr",
...     align_axis_0=True,
...     chunk_shapes=[50],
... )

Per-array chunk overrides – re-tile only the selected arrays:

>>> ds = ray.data.read_zarr(  
...     "s3://anonymous@ray-example-data/mnist-tiny.zarr",
...     chunk_shapes={"images": [50], "labels": [50]},
... )
Parameters:
  • path – Path to the Zarr v2 store.

  • filesystem – The filesystem implementation to read from. PyArrow filesystems are specified in the pyarrow docs. Specify this parameter if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with s3://, the S3FileSystem is used. Also acceptsan fsspec.spec.AbstractFileSystem. pyarrow filesystems are wrapped internally with fsspec.implementations.arrow.ArrowFSWrapper

  • chunk_shapes – Optional re-tiling of the leading chunk axes at read time (see Working with Zarr). Either a sequence applied as a shared prefix across all selected arrays (trailing axes keep native chunks), or a dict of per-array prefixes (arrays absent from it keep native chunks). An override may not exceed its target array’s rank. Defaults to native chunks.

  • array_paths – Optional list of array paths within the Zarr store to read. If unspecified, all arrays discovered in the store are included.

  • allow_full_metadata_scan – If True, recursively scan the store for .zarray files when array_paths is unspecified and .zmetadata is missing. This may be slow or expensive for large remote stores, so it is disabled by default.

  • align_axis_0 – If True, emit the wide-form schema: one row per axis-0 chunk with one column per selected array, plus t_start and t_stop columns naming the global axis-0 range. All selected arrays must share shape[0] and resolve to the same effective axis-0 chunk size after chunk_shapes resolution. Defaults to False (long-form, one chunk per row).

  • overlap – The number of additional axis-0 timesteps to extend each row’s per-array data forward by, clipped at the store end, for sliding-window pipelines. Only valid with align_axis_0=True. Defaults to 0. See Working with Zarr.

  • concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.

  • override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.

  • num_cpus – The number of CPUs to reserve for each parallel read worker.

  • num_gpus – The number of GPUs to reserve for each parallel read worker. For example, specify num_gpus=1 to request 1 GPU for each parallel read worker.

  • memory – The heap memory in bytes to reserve for each parallel read worker.

  • ray_remote_args – kwargs passed to remote() in the read tasks.

Returns:

A Dataset of long-form chunk rows by default (array, chunk_index, chunk_slices, chunk), or wide-form aligned rows (t_start, t_stop, plus one column per aligned array) when align_axis_0 is set.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.