ray.data.Dataset.add_column#

Dataset.add_column(col: str, fn: Callable[[pandas.DataFrame], pandas.Series], *, compute: Optional[str] = None, **ray_remote_args) Dataset[T][source]#

Add the given column to the dataset.

This is only supported for datasets convertible to pandas format. A function generating the new column values given the batch in pandas format must be specified.

Examples

>>> import ray
>>> ds = ray.data.range_table(100)
>>> # Add a new column equal to value * 2.
>>> ds = ds.add_column(
...     "new_col", lambda df: df["value"] * 2)
>>> # Overwrite the existing "value" with zeros.
>>> ds = ds.add_column("value", lambda df: 0)

Time complexity: O(dataset size / parallelism)

Parameters
  • col – Name of the column to add. If the name already exists, the column will be overwritten.

  • fn – Map function generating the column values given a batch of records in pandas format.

  • compute – The compute strategy, either β€œtasks” (default) to use Ray tasks, or ActorPoolStrategy(min, max) to use an autoscaling actor pool.

  • ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).