ray.data.preprocessors.CountVectorizer#
- class ray.data.preprocessors.CountVectorizer(columns: List[str], tokenization_fn: Callable[[str], List[str]] | None = None, max_features: int | None = None, *, output_columns: List[str] | None = None)[source]#
Bases:
Preprocessor
Count the frequency of tokens in a column of strings.
CountVectorizer
operates on columns that contain strings. For example:corpus 0 I dislike Python 1 I like Python
This preprocessor creates a list column for each input column. Each list contains the frequency counts of tokens in order of their first appearance. For example:
corpus 0 [1, 1, 1, 0] # Counts for [I, dislike, Python, like] 1 [1, 0, 1, 1] # Counts for [I, dislike, Python, like]
Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import CountVectorizer >>> >>> df = pd.DataFrame({ ... "corpus": [ ... "Jimmy likes volleyball", ... "Bob likes volleyball too", ... "Bob also likes fruit jerky" ... ] ... }) >>> ds = ray.data.from_pandas(df) >>> >>> vectorizer = CountVectorizer(["corpus"]) >>> vectorizer.fit_transform(ds).to_pandas() corpus 0 [1, 0, 1, 1, 0, 0, 0, 0] 1 [1, 1, 1, 0, 0, 0, 0, 1] 2 [1, 1, 0, 0, 1, 1, 1, 0]
You can limit the number of tokens in the vocabulary with
max_features
.>>> vectorizer = CountVectorizer(["corpus"], max_features=3) >>> vectorizer.fit_transform(ds).to_pandas() corpus 0 [1, 0, 1] 1 [1, 1, 1] 2 [1, 1, 0]
CountVectorizer
can also be used in append mode by providing the name of the output_columns that should hold the encoded values.>>> vectorizer = CountVectorizer(["corpus"], output_columns=["corpus_counts"]) >>> vectorizer.fit_transform(ds).to_pandas() corpus corpus_counts 0 Jimmy likes volleyball [1, 0, 1, 1, 0, 0, 0, 0] 1 Bob likes volleyball too [1, 1, 1, 0, 0, 0, 0, 1] 2 Bob also likes fruit jerky [1, 1, 0, 0, 1, 1, 1, 0]
- Parameters:
columns – The columns to separately tokenize and count.
tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to
lambda s: s.split(" ")
.max_features – The maximum number of tokens to encode in the transformed dataset. If specified, only the most frequent tokens are encoded.
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columns
must match the length ofcolumns
, othwerwise an error will be raised.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.