ray.data.Dataset.with_column#

Dataset.with_column(column_name: str, expr: Expr, *, compute: ComputeStrategy | None = None, **ray_remote_args) Dataset[source]#

Add a new column to the dataset via an expression.

This method allows you to add a new column to a dataset by applying an expression. The expression can be composed of existing columns, literals, and user-defined functions (UDFs).

For callable class UDFs, Ray Data automatically uses actor semantics to maintain state across batches. You can customize the compute strategy to control parallelism and resource allocation.

Examples

>>> import ray
>>> from ray.data.expressions import col
>>> ds = ray.data.range(100)
>>> # Add a new column 'id_2' by multiplying 'id' by 2.
>>> ds.with_column("id_2", col("id") * 2).show(2)
{'id': 0, 'id_2': 0}
{'id': 1, 'id_2': 2}
>>> # Using a UDF with with_column
>>> from ray.data.datatype import DataType
>>> from ray.data.expressions import udf
>>> import pyarrow.compute as pc
>>>
>>> @udf(return_dtype=DataType.int32())
... def add_one(column):
...     return pc.add(column, 1)
>>>
>>> ds.with_column("id_plus_one", add_one(col("id"))).show(2)
{'id': 0, 'id_plus_one': 1}
{'id': 1, 'id_plus_one': 2}
>>> # Using a callable class UDF (automatically uses actors)
>>> @udf(return_dtype=DataType.int32())
... class AddOffset:
...     def __init__(self, offset):
...         self.offset = offset
...     def __call__(self, x):
...         return pc.add(x, self.offset)
>>>
>>> add_five = AddOffset(5)
>>> ds.with_column("id_plus_five", add_five(col("id"))).show(2)
{'id': 0, 'id_plus_five': 5}
{'id': 1, 'id_plus_five': 6}
Parameters:
  • column_name – The name of the new column.

  • expr – An expression that defines the new column values.

  • compute

    The compute strategy to use for the projection operation. If not specified and the expression contains callable class UDFs, Ray Data automatically uses ActorPoolStrategy for actor semantics. Otherwise, uses TaskPoolStrategy.

    • Use ray.data.ActorPoolStrategy(size=n) to use a fixed size actor pool of n workers.

    • Use ray.data.ActorPoolStrategy(min_size=m, max_size=n) to use an autoscaling actor pool from m to n workers.

  • **ray_remote_args – Additional resource requirements to request from Ray for the map tasks (e.g., num_gpus=1).

Returns:

A new dataset with the added column evaluated via the expression.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.