API Guide for Users from Other Data Libraries

Ray Datasets is a data loading and preprocessing library for ML. It shares certain similarities with other ETL data processing libraries, but also has its own focus. In this API guide, we will provide API mappings for users who come from those data libraries, so you can quickly map what you may already know to Ray Datasets APIs.

Note

  • This is meant to map APIs that perform comparable but not necessarily identical operations. Please check the API reference for exact semantics and usage.

  • This list may not be exhaustive: Ray Datasets is not a traditional ETL data processing library, so not all data processing APIs can map to Datasets. In addition, we try to focus on common APIs or APIs that are less obvious to see a connection.

For Pandas Users

Pandas DataFrame vs. Ray Datasets APIs

Pandas DataFrame API

Ray Datasets API

df.head()

ds.show() or ds.take()

df.dtypes

ds.schema()

len(df) or df.shape[0]

ds.count()

df.truncate()

ds.limit()

df.iterrows()

ds.iter_rows()

df.drop()

ds.drop_columns()

df.transform()

ds.map_batches() or ds.map()

df.groupby()

ds.groupby()

df.groupby().apply()

ds.groupby().map_groups()

df.sample()

ds.random_sample()

df.sort_values()

ds.sort()

df.append()

ds.union()

df.aggregate()

ds.aggregate()

df.min()

ds.min()

df.max()

ds.max()

df.sum()

ds.sum()

df.mean()

ds.mean()

df.std()

ds.std()

For PyArrow Users

PyArrow Table vs. Ray Datasets APIs

PyArrow Table API

Ray Datasets API

pa.Table.schema

ds.schema()

pa.Table.num_rows

ds.count()

pa.Table.filter()

ds.filter()

pa.Table.drop()

ds.drop_columns()

pa.Table.add_column()

ds.add_column()

pa.Table.groupby()

ds.groupby()

pa.Table.sort_by()

ds.sort()