Inspecting Data#

Inspect Datasets to better understand your data.

This guide shows you how to:

Describe datasets
Inspect rows
Inspect batches
Inspect execution statistics

Describing datasets#

Datasets are tabular. To view a dataset’s column names and types, call Dataset.schema().

import ray

ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")

print(ds.schema())

Column             Type
------             ----
sepal length (cm)  double
sepal width (cm)   double
petal length (cm)  double
petal width (cm)   double
target             int64

For more information like the number of rows, print the Dataset.

import ray

ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")

print(ds)

Dataset(
   num_rows=150,
   schema={
      sepal length (cm): double,
      sepal width (cm): double,
      petal length (cm): double,
      petal width (cm): double,
      target: int64
   }
)

Inspecting rows#

To get a list of rows, call Dataset.take() or Dataset.take_all(). Ray Data represents each row as a dictionary.

import ray

ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")

rows = ds.take(1)
print(rows)

[{'sepal length (cm)': 5.1, 'sepal width (cm)': 3.5, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'target': 0}]

For more information on working with rows, see Transforming rows and Iterating over rows.

Inspecting batches#

A batch contains data from multiple rows. To inspect batches, call Dataset.take_batch() <ray.data.Dataset.take_batch>.

By default, Ray Data represents batches as dicts of NumPy ndarrays. To change the type of the returned batch, set batch_format.

NumPy

import ray

ds = ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple")

batch = ds.take_batch(batch_size=2, batch_format="numpy")
print("Batch:", batch)
print("Image shape", batch["image"].shape)

Batch: {'image': array([[[[...]]]], dtype=uint8)}
Image shape: (2, 32, 32, 3)

pandas

import ray

ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")

batch = ds.take_batch(batch_size=2, batch_format="pandas")
print(batch)

   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  target
0                5.1               3.5  ...               0.2       0
1                4.9               3.0  ...               0.2       0
<BLANKLINE>
[2 rows x 5 columns]

For more information on working with batches, see Transforming batches and Iterating over batches.

Inspecting execution statistics#

Ray Data calculates statistics during execution for each operator, such as wall clock time and memory usage.

To view stats about your Datasets, call Dataset.stats() on an executed dataset. The stats are also persisted under /tmp/ray/session_*/logs/ray-data/ray-data.log. For more on how to read this output, see Monitoring Your Workload with the Ray Data Dashboard.

import ray
import datasets

def f(batch):
    return batch

def g(row):
    return True

hf_ds = datasets.load_dataset("mnist", "mnist")
ds = (
    ray.data.from_huggingface(hf_ds["train"])
    .map_batches(f)
    .filter(g)
    .materialize()
)

print(ds.stats())

Operator 1 ReadParquet->SplitBlocks(32): 1 tasks executed, 32 blocks produced in 2.92s
* Remote wall time: 103.38us min, 1.34s max, 42.14ms mean, 1.35s total
* Remote cpu time: 102.0us min, 164.66ms max, 5.37ms mean, 171.72ms total
* UDF time: 0us min, 0us max, 0.0us mean, 0us total
* Peak heap memory usage (MiB): 266375.0 min, 281875.0 max, 274491 mean
* Output num rows per block: 1875 min, 1875 max, 1875 mean, 60000 total
* Output size bytes per block: 537986 min, 555360 max, 545963 mean, 17470820 total
* Output rows per task: 60000 min, 60000 max, 60000 mean, 1 tasks used
* Tasks per node: 1 min, 1 max, 1 mean; 1 nodes used
* Operator throughput:
    * Ray Data throughput: 20579.80984833993 rows/s
    * Estimated single node throughput: 44492.67361278733 rows/s

Operator 2 MapBatches(f)->Filter(g): 32 tasks executed, 32 blocks produced in 3.63s
* Remote wall time: 675.48ms min, 1.0s max, 797.07ms mean, 25.51s total
* Remote cpu time: 673.41ms min, 897.32ms max, 768.09ms mean, 24.58s total
* UDF time: 661.65ms min, 978.04ms max, 778.13ms mean, 24.9s total
* Peak heap memory usage (MiB): 152281.25 min, 286796.88 max, 164231 mean
* Output num rows per block: 1875 min, 1875 max, 1875 mean, 60000 total
* Output size bytes per block: 530251 min, 547625 max, 538228 mean, 17223300 total
* Output rows per task: 1875 min, 1875 max, 1875 mean, 32 tasks used
* Tasks per node: 32 min, 32 max, 32 mean; 1 nodes used
* Operator throughput:
    * Ray Data throughput: 16512.364546087643 rows/s
    * Estimated single node throughput: 2352.3683708977856 rows/s

Dataset throughput:
    * Ray Data throughput: 11463.372316361854 rows/s
    * Estimated single node throughput: 25580.963670075285 rows/s