ray.data.Dataset.add_column#

Add the given column to the dataset.

A function generating the new column values given the batch in pyarrow or pandas format must be specified. This function must operate on batches of batch_format.

Examples

>>> import ray
>>> ds = ray.data.range(100)
>>> ds.schema()
Column  Type
------  ----
id      int64

Add a new column equal to id * 2.

>>> ds.add_column("new_id", lambda df: df["id"] * 2).schema()
Column  Type
------  ----
id      int64
new_id  int64

Time complexity: O(dataset size / parallelism)

Parameters:

col – Name of the column to add. If the name already exists, the column is overwritten.
fn – Map function generating the column values given a batch of records in pandas format.
batch_format – If "default" or "numpy", batches are Dict[str, numpy.ndarray]. If "pandas", batches are pandas.DataFrame. If "pyarrow", batches are pyarrow.Table. If "numpy", batches are Dict[str, numpy.ndarray].
compute – This argument is deprecated. Use concurrency argument.
concurrency – The maximum number of Ray workers to use concurrently.
ray_remote_args – Additional resource requirements to request from Ray (e.g., num_gpus=1 to request GPUs for the map tasks). See ray.remote() for details.