ray.data.Dataset.add_column#
- Dataset.add_column(col: str, fn: Callable[[pyarrow.Table | pandas.DataFrame | Dict[str, numpy.ndarray]], pyarrow.ChunkedArray | pyarrow.Array | pandas.Series | numpy.ndarray], *, batch_format: str | None = 'pandas', compute: str | None = None, concurrency: int | Tuple[int, int] | None = None, **ray_remote_args) Dataset [source]#
Add the given column to the dataset.
A function generating the new column values given the batch in pyarrow or pandas format must be specified. This function must operate on batches of
batch_format
.Examples
>>> import ray >>> ds = ray.data.range(100) >>> ds.schema() Column Type ------ ---- id int64
Add a new column equal to
id * 2
.>>> ds.add_column("new_id", lambda df: df["id"] * 2).schema() Column Type ------ ---- id int64 new_id int64
Time complexity: O(dataset size / parallelism)
- Parameters:
col – Name of the column to add. If the name already exists, the column is overwritten.
fn – Map function generating the column values given a batch of records in pandas format.
batch_format – If
"default"
or"numpy"
, batches areDict[str, numpy.ndarray]
. If"pandas"
, batches arepandas.DataFrame
. If"pyarrow"
, batches arepyarrow.Table
. If"numpy"
, batches areDict[str, numpy.ndarray]
.compute – This argument is deprecated. Use
concurrency
argument.concurrency – The number of Ray workers to use concurrently. For a fixed-sized worker pool of size
n
, specifyconcurrency=n
. For an autoscaling worker pool fromm
ton
workers, specifyconcurrency=(m, n)
.ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).