Expressions API#

Expressions provide a way to specify column-based operations on datasets. Use col() to reference columns and lit() to create literal values. You can combine these with operators to create complex expressions for filtering, transformations, and computations.

Public API#

star

References all input columns from the input.

col

Reference an existing column by name.

lit

Create a literal expression from a constant value.

udf

Decorator to convert a UDF into an expression-compatible function.

pyarrow_udf

Decorator for PyArrow compute functions with automatic format conversion.

download

Create a download expression that downloads content from URIs.

Expression Classes#

These classes represent the structure of expressions. You typically don’t need to instantiate them directly, but you may encounter them when working with expressions.

Expr

Base class for all expression nodes.

ColumnExpr

Expression that references a column by name.

LiteralExpr

Expression that represents a constant scalar value.

BinaryExpr

Expression that represents a binary operation between two expressions.

UnaryExpr

Expression that represents a unary operation on a single expression.

UDFExpr

Expression that represents a user-defined function call.

StarExpr

Expression that represents all columns from the input.

Expression namespaces#

These namespace classes provide specialized operations for list, string, and struct columns. You access them through properties on expressions: .list, .str, and .struct.

The following example shows how to use the string namespace to transform text columns:

import ray
from ray.data.expressions import col

# Create a dataset with a text column
ds = ray.data.from_items([
    {"name": "alice"},
    {"name": "bob"},
    {"name": "charlie"}
])

# Use the string namespace to uppercase the names
ds = ds.with_column("upper_name", col("name").str.upper())
ds.show()
{'name': 'alice', 'upper_name': 'ALICE'}
{'name': 'bob', 'upper_name': 'BOB'}
{'name': 'charlie', 'upper_name': 'CHARLIE'}

The following example demonstrates using the list namespace to work with array columns:

import ray
from ray.data.expressions import col

# Create a dataset with list columns
ds = ray.data.from_items([
    {"scores": [85, 90, 78]},
    {"scores": [92, 88]},
    {"scores": [76, 82, 88, 91]}
])

# Use the list namespace to get the length of each list
ds = ds.with_column("num_scores", col("scores").list.len())
ds.show()
{'scores': [85, 90, 78], 'num_scores': 3}
{'scores': [92, 88], 'num_scores': 2}
{'scores': [76, 82, 88, 91], 'num_scores': 4}

The following example shows how to use the struct namespace to access nested fields:

import ray
from ray.data.expressions import col

# Create a dataset with struct columns
ds = ray.data.from_items([
    {"user": {"name": "alice", "age": 25}},
    {"user": {"name": "bob", "age": 30}},
    {"user": {"name": "charlie", "age": 35}}
])

# Use the struct namespace to extract a specific field
ds = ds.with_column("user_name", col("user").struct.field("name"))
ds.show()
{'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
{'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
{'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}
class ray.data.expressions._ListNamespace(_expr: Expr)[source]#

Namespace for list operations on expression columns.

This namespace provides methods for operating on list-typed columns using PyArrow compute functions.

Example

>>> from ray.data.expressions import col
>>> # Get length of list column
>>> expr = col("items").list.len()
>>> # Get first item using method
>>> expr = col("items").list.get(0)
>>> # Get first item using indexing
>>> expr = col("items").list[0]
>>> # Slice list
>>> expr = col("items").list[1:3]
len() UDFExpr[source]#

Get the length of each list.

get(index: int) UDFExpr[source]#

Get element at the specified index from each list.

Parameters:

index – The index of the element to retrieve. Negative indices are supported.

Returns:

UDFExpr that extracts the element at the given index.

slice(start: int | None = None, stop: int | None = None, step: int | None = None) UDFExpr[source]#

Slice each list.

Parameters:
  • start – Start index (inclusive). Defaults to 0.

  • stop – Stop index (exclusive). Defaults to list length.

  • step – Step size. Defaults to 1.

Returns:

UDFExpr that extracts a slice from each list.

class ray.data.expressions._StringNamespace(_expr: Expr)[source]#

Namespace for string operations on expression columns.

This namespace provides methods for operating on string-typed columns using PyArrow compute functions.

Example

>>> from ray.data.expressions import col
>>> # Convert to uppercase
>>> expr = col("name").str.upper()
>>> # Get string length
>>> expr = col("name").str.len()
>>> # Check if string starts with a prefix
>>> expr = col("name").str.starts_with("A")
len() UDFExpr[source]#

Get the length of each string in characters.

byte_len() UDFExpr[source]#

Get the length of each string in bytes.

upper() UDFExpr[source]#

Convert strings to uppercase.

lower() UDFExpr[source]#

Convert strings to lowercase.

capitalize() UDFExpr[source]#

Capitalize the first character of each string.

title() UDFExpr[source]#

Convert strings to title case.

swapcase() UDFExpr[source]#

Swap the case of each character.

is_alpha() UDFExpr[source]#

Check if strings contain only alphabetic characters.

is_alnum() UDFExpr[source]#

Check if strings contain only alphanumeric characters.

is_digit() UDFExpr[source]#

Check if strings contain only digits.

is_decimal() UDFExpr[source]#

Check if strings contain only decimal characters.

is_numeric() UDFExpr[source]#

Check if strings contain only numeric characters.

is_space() UDFExpr[source]#

Check if strings contain only whitespace.

is_lower() UDFExpr[source]#

Check if strings are lowercase.

is_upper() UDFExpr[source]#

Check if strings are uppercase.

is_title() UDFExpr[source]#

Check if strings are title-cased.

is_printable() UDFExpr[source]#

Check if strings contain only printable characters.

is_ascii() UDFExpr[source]#

Check if strings contain only ASCII characters.

starts_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings start with a pattern.

ends_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings end with a pattern.

contains(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings contain a substring.

match(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Match strings against a SQL LIKE pattern.

find(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Find the first occurrence of a substring.

count(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Count occurrences of a substring.

find_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Find the first occurrence matching a regex pattern.

count_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Count occurrences matching a regex pattern.

match_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings match a regex pattern.

reverse() UDFExpr[source]#

Reverse each string.

slice(*args: Any, **kwargs: Any) UDFExpr[source]#

Slice strings by codeunit indices.

replace(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Replace occurrences of a substring.

replace_regex(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Replace occurrences matching a regex pattern.

replace_slice(start: int, stop: int, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Replace a slice with a string.

split(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Split strings by a pattern.

split_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Split strings by a regex pattern.

split_whitespace(*args: Any, **kwargs: Any) UDFExpr[source]#

Split strings on whitespace.

extract(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Extract a substring matching a regex pattern.

repeat(n: int, *args: Any, **kwargs: Any) UDFExpr[source]#

Repeat each string n times.

center(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#

Center strings in a field of given width.

strip(characters: str | None = None) UDFExpr[source]#

Remove leading and trailing whitespace or specified characters.

Parameters:

characters – Characters to remove. If None, removes whitespace.

Returns:

UDFExpr that strips characters from both ends.

lstrip(characters: str | None = None) UDFExpr[source]#

Remove leading whitespace or specified characters.

Parameters:

characters – Characters to remove. If None, removes whitespace.

Returns:

UDFExpr that strips characters from the left.

rstrip(characters: str | None = None) UDFExpr[source]#

Remove trailing whitespace or specified characters.

Parameters:

characters – Characters to remove. If None, removes whitespace.

Returns:

UDFExpr that strips characters from the right.

pad(width: int, fillchar: str = ' ', side: Literal['left', 'right', 'both'] = 'right') UDFExpr[source]#

Pad strings to a specified width.

Parameters:
  • width – Target width.

  • fillchar – Character to use for padding.

  • side – “left”, “right”, or “both” for padding side.

Returns:

UDFExpr that pads strings.

class ray.data.expressions._StructNamespace(_expr: Expr)[source]#

Namespace for struct operations on expression columns.

This namespace provides methods for operating on struct-typed columns using PyArrow compute functions.

Example

>>> from ray.data.expressions import col
>>> # Access a field using method
>>> expr = col("user_record").struct.field("age")
>>> # Access a field using bracket notation
>>> expr = col("user_record").struct["age"]
>>> # Access nested field
>>> expr = col("user_record").struct["address"].struct["city"]
field(field_name: str) UDFExpr[source]#

Extract a field from a struct.

Parameters:

field_name – The name of the field to extract.

Returns:

UDFExpr that extracts the specified field from each struct.