ray.data.read_zarr#
- ray.data.read_zarr(path: str, *, filesystem: pyarrow.fs.FileSystem | fsspec.spec.AbstractFileSystem | None = None, chunk_shapes: dict[str, list] | list | None = None, array_paths: list[str] | None = None, allow_full_metadata_scan: bool = False, align_axis_0: bool = False, overlap: int = 0, concurrency: int | None = None, override_num_blocks: int | None = None, num_cpus: float | None = None, num_gpus: float | None = None, memory: float | None = None, ray_remote_args: Dict[str, Any] | None = None)[source]#
Creates a
Datasetfrom a Zarr v2 store.By default each row is one chunk of one array (long-form), with columns
array,chunk_index,chunk_slices, andchunk. Withalign_axis_0=True, each row is one axis-0 chunk witht_start,t_stop, and one column per selected array (wide-form), for arrays that shareshape[0].For the output schemas, chunk re-tiling, aligned and sliding-window reads, metadata discovery, custom codecs, and cloud-storage setup, see Working with Zarr.
Note
In long-form the
chunkcolumn is a tensor, and tensors of different rank or dtype can’t be combined into one batch. Consume long-form per array (filter on thearraycolumn first), or, when arrays are row-aligned (shareshape[0]), usealign_axis_0=Trueso each array is its own column – which is batch-safe.Examples
Read every array at its native chunking (long-form, one row per chunk):
>>> import ray >>> ds = ray.data.read_zarr( ... "s3://anonymous@ray-example-data/mnist-tiny.zarr", ... )
Aligned read – paired
(images, labels)per row;align_axis_0requires all selected arrays to shareshape[0]:>>> ds = ray.data.read_zarr( ... "s3://anonymous@ray-example-data/mnist-tiny.zarr", ... align_axis_0=True, ... chunk_shapes=[50], ... )
Per-array chunk overrides – re-tile only the selected arrays:
>>> ds = ray.data.read_zarr( ... "s3://anonymous@ray-example-data/mnist-tiny.zarr", ... chunk_shapes={"images": [50], "labels": [50]}, ... )
- Parameters:
path – Path to the Zarr v2 store.
filesystem – The filesystem implementation to read from. PyArrow filesystems are specified in the pyarrow docs. Specify this parameter if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with
s3://, theS3FileSystemis used. Also acceptsanfsspec.spec.AbstractFileSystem. pyarrow filesystems are wrapped internally withfsspec.implementations.arrow.ArrowFSWrapperchunk_shapes – Optional re-tiling of the leading chunk axes at read time (see Working with Zarr). Either a sequence applied as a shared prefix across all selected arrays (trailing axes keep native chunks), or a dict of per-array prefixes (arrays absent from it keep native chunks). An override may not exceed its target array’s rank. Defaults to native chunks.
array_paths – Optional list of array paths within the Zarr store to read. If unspecified, all arrays discovered in the store are included.
allow_full_metadata_scan – If
True, recursively scan the store for.zarrayfiles whenarray_pathsis unspecified and.zmetadatais missing. This may be slow or expensive for large remote stores, so it is disabled by default.align_axis_0 – If
True, emit the wide-form schema: one row per axis-0 chunk with one column per selected array, plust_startandt_stopcolumns naming the global axis-0 range. All selected arrays must shareshape[0]and resolve to the same effective axis-0 chunk size afterchunk_shapesresolution. Defaults toFalse(long-form, one chunk per row).overlap – The number of additional axis-0 timesteps to extend each row’s per-array data forward by, clipped at the store end, for sliding-window pipelines. Only valid with
align_axis_0=True. Defaults to0. See Working with Zarr.concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.
num_cpus – The number of CPUs to reserve for each parallel read worker.
num_gpus – The number of GPUs to reserve for each parallel read worker. For example, specify
num_gpus=1to request 1 GPU for each parallel read worker.memory – The heap memory in bytes to reserve for each parallel read worker.
ray_remote_args – kwargs passed to
remote()in the read tasks.
- Returns:
A
Datasetof long-form chunk rows by default (array,chunk_index,chunk_slices,chunk), or wide-form aligned rows (t_start,t_stop, plus one column per aligned array) whenalign_axis_0is set.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.