Working with Tensors / NumPy#

N-dimensional arrays (in other words, tensors) are ubiquitous in ML workloads. This guide describes the limitations and best practices of working with such data.

Tensor data representation#

Ray Data represents tensors as NumPy ndarrays.

import ray

ds = ray.data.read_images("s3://anonymous@air-example-data/digits")
print(ds)

Dataset(num_rows=100, schema=...)

Batches of fixed-shape tensors#

If your tensors have a fixed shape, Ray Data represents batches as regular ndarrays.

>>> import ray
>>> ds = ray.data.read_images("s3://anonymous@air-example-data/digits")
>>> batch = ds.take_batch(batch_size=32)
>>> batch["image"].shape
(32, 28, 28)
>>> batch["image"].dtype
dtype('uint8')

Batches of variable-shape tensors#

If your tensors vary in shape, Ray Data represents batches as arrays of object dtype.

>>> import ray
>>> ds = ray.data.read_images("s3://anonymous@air-example-data/AnimalDetection")
>>> batch = ds.take_batch(batch_size=32)
>>> batch["image"].shape
(32,)
>>> batch["image"].dtype
dtype('O')

The individual elements of these object arrays are regular ndarrays.

>>> batch["image"][0].dtype
dtype('uint8')
>>> batch["image"][0].shape  
(375, 500, 3)
>>> batch["image"][3].shape  
(333, 465, 3)

Transforming tensor data#

Call map() or map_batches() to transform tensor data.

from typing import Any, Dict

import ray
import numpy as np

ds = ray.data.read_images("s3://anonymous@air-example-data/AnimalDetection")

def increase_brightness(row: Dict[str, Any]) -> Dict[str, Any]:
    row["image"] = np.clip(row["image"] + 4, 0, 255)
    return row

# Increase the brightness, record at a time.
ds.map(increase_brightness)

def batch_increase_brightness(batch: Dict[str, np.ndarray]) -> Dict:
    batch["image"] = np.clip(batch["image"] + 4, 0, 255)
    return batch

# Increase the brightness, batch at a time.
ds.map_batches(batch_increase_brightness)

In addition to NumPy ndarrays, Ray Data also treats returned lists of NumPy ndarrays and objects implementing __array__ (for example, torch.Tensor) as tensor data.

For more information on transforming data, read Transforming data.

Saving tensor data#

Save tensor data with formats like Parquet, NumPy, and JSON. To view all supported formats, see the Input/Output reference.

Parquet

Call write_parquet() to save data in Parquet files.

import ray

ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple")
ds.write_parquet("/tmp/simple")

NumPy

Call write_numpy() to save an ndarray column in NumPy files.

import ray

ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple")
ds.write_numpy("/tmp/simple", column="image")

JSON

To save images in a JSON file, call write_json().

import ray

ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple")
ds.write_json("/tmp/simple")

For more information on saving data, read Saving data.