ray.data.expressions.udf#

ray.data.expressions.udf() Callable[[...], UDFExpr][source]#

Decorator to convert a UDF into an expression-compatible function.

This decorator allows UDFs to be used seamlessly within the expression system, enabling schema inference and integration with other expressions.

IMPORTANT: UDFs operate on batches of data, not individual rows. When your UDF is called, each column argument will be passed as a PyArrow Array containing multiple values from that column across the batch. Under the hood, when working with multiple columns, they get translated to PyArrow arrays (one array per column).

Returns:

A callable that creates UDFExpr instances when called with expressions

Example

>>> from ray.data.expressions import col, udf
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import ray
>>>
>>> # UDF that operates on a batch of values (PyArrow Array)
>>> @udf()
... def add_one(x: pa.Array) -> pa.Array:
...     return pc.add(x, 1)  # Vectorized operation on the entire Array
>>>
>>> # UDF that combines multiple columns (each as a PyArrow Array)
>>> @udf()
... def format_name(first: pa.Array, last: pa.Array) -> pa.Array:
...     return pc.binary_join_element_wise(first, last, " ")  # Vectorized string concatenation
>>>
>>> # Use in dataset operations
>>> ds = ray.data.from_items([
...     {"value": 5, "first": "John", "last": "Doe"},
...     {"value": 10, "first": "Jane", "last": "Smith"}
... ])
>>>
>>> # Single column transformation (operates on batches)
>>> ds_incremented = ds.with_column("value_plus_one", add_one(col("value")))
>>>
>>> # Multi-column transformation (each column becomes a PyArrow Array)
>>> ds_formatted = ds.with_column("full_name", format_name(col("first"), col("last")))
>>>
>>> # Can also be used in complex expressions
>>> ds_complex = ds.with_column("doubled_plus_one", add_one(col("value")) * 2)

PublicAPI (alpha): This API is in alpha and may change before becoming stable.