Expressions API#
Expressions provide a way to specify column-based operations on datasets.
Use col() to reference columns and lit() to create literal values.
You can combine these with operators to create complex expressions for filtering,
transformations, and computations.
Public API#
References all input columns from the input. |
|
Reference an existing column by name. |
|
Create a literal expression from a constant value. |
|
Decorator to convert a UDF into an expression-compatible function. |
|
Decorator for PyArrow compute functions with automatic format conversion. |
|
Create a download expression that downloads content from URIs. |
|
Create an expression that generates monotonically increasing IDs. |
|
Create an expression that generates random numbers. |
|
Create a UUID expression that generates unique identifiers. |
Expression Classes#
These classes represent the structure of expressions. You typically don’t need to instantiate them directly, but you may encounter them when working with expressions.
Base class for all expression nodes. |
|
Expression that references a column by name. |
|
Expression that represents a constant scalar value. |
|
Expression that represents a binary operation between two expressions. |
|
Expression that represents a unary operation on a single expression. |
|
Expression that represents a user-defined function call. |
|
Expression that represents all columns from the input. |
|
Expression that represents a download operation. |
|
Expression that represents a monotonically increasing ID column. |
|
Expression that represents a random number generation operation. |
|
Expression that represents a UUID generation operation. |
Expression namespaces#
These namespace classes provide specialized operations for list, string, struct, array, and
datetime columns. You access them through properties on expressions: .list, .str,
.struct, .arr, and .dt.
The following example shows how to use the string namespace to transform text columns:
import ray
from ray.data.expressions import col
# Create a dataset with a text column
ds = ray.data.from_items([
{"name": "alice"},
{"name": "bob"},
{"name": "charlie"}
])
# Use the string namespace to uppercase the names
ds = ds.with_column("upper_name", col("name").str.upper())
ds.show()
{'name': 'alice', 'upper_name': 'ALICE'}
{'name': 'bob', 'upper_name': 'BOB'}
{'name': 'charlie', 'upper_name': 'CHARLIE'}
The following example demonstrates using the list namespace to work with array columns:
import ray
from ray.data.expressions import col
# Create a dataset with list columns
ds = ray.data.from_items([
{"scores": [85, 90, 78]},
{"scores": [92, 88]},
{"scores": [76, 82, 88, 91]}
])
# Use the list namespace to get the length of each list
ds = ds.with_column("num_scores", col("scores").list.len())
ds.show()
{'scores': [85, 90, 78], 'num_scores': 3}
{'scores': [92, 88], 'num_scores': 2}
{'scores': [76, 82, 88, 91], 'num_scores': 4}
You can also perform list-specific transformations like sorting and flattening:
import ray
from ray.data.expressions import col
ds = ray.data.from_items([
{"values": [3, 1, 2], "nested": [[1, 2], [3]]},
{"values": [2, None, 5], "nested": [[4], []]}
])
ds = ds.with_column(
"sorted_values", col("values").list.sort(order="descending")
)
ds = ds.with_column(
"flattened_nested", col("nested").list.flatten()
)
ds.show()
{'values': [3, 1, 2], 'nested': [[1, 2], [3]], 'sorted_values': [3, 2, 1], 'flattened_nested': [1, 2, 3]}
{'values': [2, None, 5], 'nested': [[4], []], 'sorted_values': [5, 2, None], 'flattened_nested': [4]}
The following example shows how to use the struct namespace to access nested fields:
import ray
from ray.data.expressions import col
# Create a dataset with struct columns
ds = ray.data.from_items([
{"user": {"name": "alice", "age": 25}},
{"user": {"name": "bob", "age": 30}},
{"user": {"name": "charlie", "age": 35}}
])
# Use the struct namespace to extract a specific field
ds = ds.with_column("user_name", col("user").struct.field("name"))
ds.show()
{'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
{'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
{'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}
The following example shows how to use the array namespace to convert fixed-size list columns to variable-length lists:
import pyarrow as pa
import ray
from ray.data.expressions import col
values = pa.array([1, 2, 3, 4])
fixed = pa.FixedSizeListArray.from_arrays(values, 2)
table = pa.table({"features": fixed})
ds = ray.data.from_arrow(table)
ds = ds.with_column("features_list", col("features").arr.to_list())
ds.show()
{'features': [1, 2], 'features_list': [1, 2]}
{'features': [3, 4], 'features_list': [3, 4]}
The following example shows how to use the datetime namespace to extract components:
import datetime
import pandas as pd
import ray
from ray.data.expressions import col
ds = ray.data.from_items([
{"ts": pd.Timestamp("2024-01-02 03:04:05")},
{"ts": pd.Timestamp("2024-02-03 04:05:06")}
])
ds = ds.with_column("year", col("ts").dt.year())
ds.show()
{'ts': datetime.datetime(2024, 1, 2, 3, 4, 5), 'year': 2024}
{'ts': datetime.datetime(2024, 2, 3, 4, 5, 6), 'year': 2024}
- class ray.data.expressions._ListNamespace(_expr: Expr)[source]#
Namespace for list operations on expression columns.
This namespace provides methods for operating on list-typed columns using PyArrow compute functions.
Example
>>> from ray.data.expressions import col >>> # Get length of list column >>> expr = col("items").list.len() >>> # Get first item using method >>> expr = col("items").list.get(0) >>> # Get first item using indexing >>> expr = col("items").list[0] >>> # Slice list >>> expr = col("items").list[1:3]
- get(index: int) PyArrowComputeUDFExpr[source]#
Get element at the specified index from each list.
- Parameters:
index – The index of the element to retrieve. Negative indices are supported.
- Returns:
Expression that extracts the element at the given index.
- slice(start: int | None = None, stop: int | None = None, step: int | None = None) PyArrowComputeUDFExpr[source]#
Slice each list.
- Parameters:
start – Start index (inclusive). Defaults to 0.
stop – Stop index (exclusive). Defaults to list length.
step – Step size. Defaults to 1.
- Returns:
Expression that extracts a slice from each list.
- sort(order: Literal['ascending', 'descending'] = 'ascending', null_placement: Literal['at_start', 'at_end'] = 'at_end') UDFExpr[source]#
Sort the elements within each (nested) list.
- Parameters:
order – Sorting order, must be
"ascending"or"descending".null_placement – Placement for null values,
"at_start"or"at_end".
- Returns:
UDFExpr providing the sorted lists.
Example
>>> from ray.data.expressions import col >>> # [[3,1],[2,None]] -> [[1,3],[2,None]] >>> expr = col("items").list.sort()
- class ray.data.expressions._StringNamespace(_expr: Expr)[source]#
Namespace for string operations on expression columns.
This namespace provides methods for operating on string-typed columns using PyArrow compute functions.
Example
>>> from ray.data.expressions import col >>> # Convert to uppercase >>> expr = col("name").str.upper() >>> # Get string length >>> expr = col("name").str.len() >>> # Check if string starts with a prefix >>> expr = col("name").str.starts_with("A")
- starts_with(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Check if strings start with a pattern.
- ends_with(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Check if strings end with a pattern.
- contains(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Check if strings contain a substring.
- match(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Match strings against a SQL LIKE pattern.
- find(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Find the first occurrence of a substring.
- count(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Count occurrences of a substring.
- find_regex(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Find the first occurrence matching a regex pattern.
- count_regex(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Count occurrences matching a regex pattern.
- match_regex(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Check if strings match a regex pattern.
- replace(pattern: str, replacement: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Replace occurrences of a substring.
- replace_regex(pattern: str, replacement: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Replace occurrences matching a regex pattern.
- replace_slice(start: int, stop: int, replacement: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Replace a slice with a string.
- split(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Split strings by a pattern.
- split_regex(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Split strings by a regex pattern.
- split_whitespace(*args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Split strings on whitespace.
- extract(pattern: str, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Extract a substring matching a regex pattern.
- repeat(n: int, *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Repeat each string n times.
- center(width: int, padding: str = ' ', *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Center strings in a field of given width.
- lpad(width: int, padding: str = ' ', *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Right-align strings by padding with a given character while respecting
width.If the string is longer than the specified width, it remains intact (no truncation occurs).
- rpad(width: int, padding: str = ' ', *args: Any, **kwargs: Any) PyArrowComputeUDFExpr[source]#
Left-align strings by padding with a given character while respecting
width.If the string is longer than the specified width, it remains intact (no truncation occurs).
- strip(characters: str | None = None) PyArrowComputeUDFExpr[source]#
Remove leading and trailing whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
Expression that strips characters from both ends.
- lstrip(characters: str | None = None) PyArrowComputeUDFExpr[source]#
Remove leading whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
Expression that strips characters from the left.
- rstrip(characters: str | None = None) PyArrowComputeUDFExpr[source]#
Remove trailing whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
Expression that strips characters from the right.
- pad(width: int, fillchar: str = ' ', side: Literal['left', 'right', 'both'] = 'right') PyArrowComputeUDFExpr[source]#
Pad strings to a specified width.
- Parameters:
width – Target width.
fillchar – Character to use for padding.
side – “left”, “right”, or “both” for padding side.
- Returns:
Expression that pads strings to the given width.
- class ray.data.expressions._StructNamespace(_expr: Expr)[source]#
Namespace for struct operations on expression columns.
This namespace provides methods for operating on struct-typed columns using PyArrow compute functions.
Example
>>> from ray.data.expressions import col >>> # Access a field using method >>> expr = col("user_record").struct.field("age") >>> # Access a field using bracket notation >>> expr = col("user_record").struct["age"] >>> # Access nested field >>> expr = col("user_record").struct["address"].struct["city"]
- class ray.data.expressions._ArrayNamespace(_expr: Expr)[source]#
Namespace for array operations on expression columns.
Example
>>> from ray.data.expressions import col >>> # Convert fixed-size lists to variable-length lists >>> expr = col("features").arr.to_list()
- class ray.data.expressions._DatetimeNamespace(_expr: Expr)[source]#
Datetime namespace for operations on datetime-typed expression columns.
- ceil(unit: TemporalUnit) PyArrowComputeUDFExpr[source]#
Ceil timestamps to the next multiple of the given unit.