ray.data.preprocessors.CountVectorizer#

class ray.data.preprocessors.CountVectorizer(columns: List[str], tokenization_fn: Callable[[str], List[str]] | None = None, max_features: int | None = None, *, output_columns: List[str] | None = None)[source]#

Bases: Preprocessor

Count the frequency of tokens in a column of strings.

CountVectorizer operates on columns that contain strings. For example:

                corpus
0    I dislike Python
1       I like Python

This preprocessor creates a list column for each input column. Each list contains the frequency counts of tokens in order of their first appearance. For example:

            corpus
0    [1, 1, 1, 0]  # Counts for [I, dislike, Python, like]
1    [1, 0, 1, 1]  # Counts for [I, dislike, Python, like]

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import CountVectorizer
>>>
>>> df = pd.DataFrame({
...     "corpus": [
...         "Jimmy likes volleyball",
...         "Bob likes volleyball too",
...         "Bob also likes fruit jerky"
...     ]
... })
>>> ds = ray.data.from_pandas(df)  
>>>
>>> vectorizer = CountVectorizer(["corpus"])
>>> vectorizer.fit_transform(ds).to_pandas()  
                     corpus
0  [1, 0, 1, 1, 0, 0, 0, 0]
1  [1, 1, 1, 0, 0, 0, 0, 1]
2  [1, 1, 0, 0, 1, 1, 1, 0]

You can limit the number of tokens in the vocabulary with max_features.

>>> vectorizer = CountVectorizer(["corpus"], max_features=3)
>>> vectorizer.fit_transform(ds).to_pandas()  
      corpus
0  [1, 0, 1]
1  [1, 1, 1]
2  [1, 1, 0]

CountVectorizer can also be used in append mode by providing the name of the output_columns that should hold the encoded values.

>>> vectorizer = CountVectorizer(["corpus"], output_columns=["corpus_counts"])
>>> vectorizer.fit_transform(ds).to_pandas()  
                       corpus             corpus_counts
0      Jimmy likes volleyball  [1, 0, 1, 1, 0, 0, 0, 0]
1    Bob likes volleyball too  [1, 1, 1, 0, 0, 0, 0, 1]
2  Bob also likes fruit jerky  [1, 1, 0, 0, 1, 1, 1, 0]

Parameters:

columns – The columns to separately tokenize and count.
tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to lambda s: s.split(" ").
max_features – The maximum number of tokens to encode in the transformed dataset. If specified, only the most frequent tokens are encoded.
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`	Load the original preprocessor serialized via `self.serialize()`.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Return this preprocessor serialized as a string.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.