ray.data.preprocessors.CountVectorizer#

class ray.data.preprocessors.CountVectorizer(columns: List[str], tokenization_fn: Callable[[str], List[str]] | None = None, max_features: int | None = None, *, output_columns: List[str] | None = None)[source]#

Bases: Preprocessor

Count the frequency of tokens in a column of strings.

CountVectorizer operates on columns that contain strings. For example:

                corpus
0    I dislike Python
1       I like Python

This preprocessor creates a list column for each input column. Each list contains the frequency counts of tokens in order of their first appearance. For example:

            corpus
0    [1, 1, 1, 0]  # Counts for [I, dislike, Python, like]
1    [1, 0, 1, 1]  # Counts for [I, dislike, Python, like]

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import CountVectorizer
>>>
>>> df = pd.DataFrame({
...     "corpus": [
...         "Jimmy likes volleyball",
...         "Bob likes volleyball too",
...         "Bob also likes fruit jerky"
...     ]
... })
>>> ds = ray.data.from_pandas(df)  
>>>
>>> vectorizer = CountVectorizer(["corpus"])
>>> vectorizer.fit_transform(ds).to_pandas()  
                     corpus
0  [1, 0, 1, 1, 0, 0, 0, 0]
1  [1, 1, 1, 0, 0, 0, 0, 1]
2  [1, 1, 0, 0, 1, 1, 1, 0]

You can limit the number of tokens in the vocabulary with max_features.

>>> vectorizer = CountVectorizer(["corpus"], max_features=3)
>>> vectorizer.fit_transform(ds).to_pandas()  
      corpus
0  [1, 0, 1]
1  [1, 1, 1]
2  [1, 1, 0]

CountVectorizer can also be used in append mode by providing the name of the output_columns that should hold the encoded values.

>>> vectorizer = CountVectorizer(["corpus"], output_columns=["corpus_counts"])
>>> vectorizer.fit_transform(ds).to_pandas()  
                       corpus             corpus_counts
0      Jimmy likes volleyball  [1, 0, 1, 1, 0, 0, 0, 0]
1    Bob likes volleyball too  [1, 1, 1, 0, 0, 0, 0, 1]
2  Bob also likes fruit jerky  [1, 1, 0, 0, 1, 1, 1, 0]
Parameters:
  • columns – The columns to separately tokenize and count.

  • tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to lambda s: s.split(" ").

  • max_features – The maximum number of tokens to encode in the transformed dataset. If specified, only the most frequent tokens are encoded.

  • output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Load the original preprocessor serialized via self.serialize().

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Return this preprocessor serialized as a string.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.