Expressions API#
Expressions provide a way to specify column-based operations on datasets.
Use col() to reference columns and lit() to create literal values.
You can combine these with operators to create complex expressions for filtering,
transformations, and computations.
Public API#
References all input columns from the input. |
|
Reference an existing column by name. |
|
Create a literal expression from a constant value. |
|
Decorator to convert a UDF into an expression-compatible function. |
|
Decorator for PyArrow compute functions with automatic format conversion. |
|
Create a download expression that downloads content from URIs. |
Expression Classes#
These classes represent the structure of expressions. You typically don’t need to instantiate them directly, but you may encounter them when working with expressions.
Base class for all expression nodes. |
|
Expression that references a column by name. |
|
Expression that represents a constant scalar value. |
|
Expression that represents a binary operation between two expressions. |
|
Expression that represents a unary operation on a single expression. |
|
Expression that represents a user-defined function call. |
|
Expression that represents all columns from the input. |
Expression namespaces#
These namespace classes provide specialized operations for list, string, and struct columns.
You access them through properties on expressions: .list, .str, and .struct.
The following example shows how to use the string namespace to transform text columns:
import ray
from ray.data.expressions import col
# Create a dataset with a text column
ds = ray.data.from_items([
{"name": "alice"},
{"name": "bob"},
{"name": "charlie"}
])
# Use the string namespace to uppercase the names
ds = ds.with_column("upper_name", col("name").str.upper())
ds.show()
{'name': 'alice', 'upper_name': 'ALICE'}
{'name': 'bob', 'upper_name': 'BOB'}
{'name': 'charlie', 'upper_name': 'CHARLIE'}
The following example demonstrates using the list namespace to work with array columns:
import ray
from ray.data.expressions import col
# Create a dataset with list columns
ds = ray.data.from_items([
{"scores": [85, 90, 78]},
{"scores": [92, 88]},
{"scores": [76, 82, 88, 91]}
])
# Use the list namespace to get the length of each list
ds = ds.with_column("num_scores", col("scores").list.len())
ds.show()
{'scores': [85, 90, 78], 'num_scores': 3}
{'scores': [92, 88], 'num_scores': 2}
{'scores': [76, 82, 88, 91], 'num_scores': 4}
The following example shows how to use the struct namespace to access nested fields:
import ray
from ray.data.expressions import col
# Create a dataset with struct columns
ds = ray.data.from_items([
{"user": {"name": "alice", "age": 25}},
{"user": {"name": "bob", "age": 30}},
{"user": {"name": "charlie", "age": 35}}
])
# Use the struct namespace to extract a specific field
ds = ds.with_column("user_name", col("user").struct.field("name"))
ds.show()
{'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
{'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
{'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}
- class ray.data.expressions._ListNamespace(_expr: Expr)[source]#
Namespace for list operations on expression columns.
This namespace provides methods for operating on list-typed columns using PyArrow compute functions.
Example
>>> from ray.data.expressions import col >>> # Get length of list column >>> expr = col("items").list.len() >>> # Get first item using method >>> expr = col("items").list.get(0) >>> # Get first item using indexing >>> expr = col("items").list[0] >>> # Slice list >>> expr = col("items").list[1:3]
- get(index: int) UDFExpr[source]#
Get element at the specified index from each list.
- Parameters:
index – The index of the element to retrieve. Negative indices are supported.
- Returns:
UDFExpr that extracts the element at the given index.
- slice(start: int | None = None, stop: int | None = None, step: int | None = None) UDFExpr[source]#
Slice each list.
- Parameters:
start – Start index (inclusive). Defaults to 0.
stop – Stop index (exclusive). Defaults to list length.
step – Step size. Defaults to 1.
- Returns:
UDFExpr that extracts a slice from each list.
- class ray.data.expressions._StringNamespace(_expr: Expr)[source]#
Namespace for string operations on expression columns.
This namespace provides methods for operating on string-typed columns using PyArrow compute functions.
Example
>>> from ray.data.expressions import col >>> # Convert to uppercase >>> expr = col("name").str.upper() >>> # Get string length >>> expr = col("name").str.len() >>> # Check if string starts with a prefix >>> expr = col("name").str.starts_with("A")
- starts_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings start with a pattern.
- ends_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings end with a pattern.
- contains(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings contain a substring.
- match(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Match strings against a SQL LIKE pattern.
- find(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Find the first occurrence of a substring.
- find_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Find the first occurrence matching a regex pattern.
- count_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Count occurrences matching a regex pattern.
- match_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings match a regex pattern.
- replace(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Replace occurrences of a substring.
- replace_regex(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Replace occurrences matching a regex pattern.
- replace_slice(start: int, stop: int, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Replace a slice with a string.
- split_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Split strings by a regex pattern.
- extract(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Extract a substring matching a regex pattern.
- center(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#
Center strings in a field of given width.
- strip(characters: str | None = None) UDFExpr[source]#
Remove leading and trailing whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
UDFExpr that strips characters from both ends.
- lstrip(characters: str | None = None) UDFExpr[source]#
Remove leading whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
UDFExpr that strips characters from the left.
- rstrip(characters: str | None = None) UDFExpr[source]#
Remove trailing whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
UDFExpr that strips characters from the right.
- pad(width: int, fillchar: str = ' ', side: Literal['left', 'right', 'both'] = 'right') UDFExpr[source]#
Pad strings to a specified width.
- Parameters:
width – Target width.
fillchar – Character to use for padding.
side – “left”, “right”, or “both” for padding side.
- Returns:
UDFExpr that pads strings.
- class ray.data.expressions._StructNamespace(_expr: Expr)[source]#
Namespace for struct operations on expression columns.
This namespace provides methods for operating on struct-typed columns using PyArrow compute functions.
Example
>>> from ray.data.expressions import col >>> # Access a field using method >>> expr = col("user_record").struct.field("age") >>> # Access a field using bracket notation >>> expr = col("user_record").struct["age"] >>> # Access nested field >>> expr = col("user_record").struct["address"].struct["city"]