Dataset.add_column(col: str, fn: Callable[[pandas.DataFrame], pandas.Series], *, compute: str | None = None, concurrency: int | Tuple[int, int] | None = None, **ray_remote_args) Dataset[source]#

Add the given column to the dataset.

A function generating the new column values given the batch in pandas format must be specified.


>>> import ray
>>> ds = ray.data.range(100)
>>> ds.schema()
Column  Type
------  ----
id      int64

Add a new column equal to id * 2.

>>> ds.add_column("new_id", lambda df: df["id"] * 2).schema()
Column  Type
------  ----
id      int64
new_id  int64

Overwrite the existing values with zeros.

>>> ds.add_column("id", lambda df: 0).take(3)
[{'id': 0}, {'id': 0}, {'id': 0}]

Time complexity: O(dataset size / parallelism)

  • col – Name of the column to add. If the name already exists, the column is overwritten.

  • fn – Map function generating the column values given a batch of records in pandas format.

  • compute – This argument is deprecated. Use concurrency argument.

  • concurrency – The number of Ray workers to use concurrently. For a fixed-sized worker pool of size n, specify concurrency=n. For an autoscaling worker pool from m to n workers, specify concurrency=(m, n).

  • ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).