ray.data.preprocessors.CountVectorizer#
- class ray.data.preprocessors.CountVectorizer(columns: List[str], tokenization_fn: Callable[[str], List[str]] | None = None, max_features: int | None = None, *, output_columns: List[str] | None = None)[source]#
Bases:
SerializablePreprocessorBaseCount the frequency of tokens in a column of strings.
CountVectorizeroperates on columns that contain strings. For example:corpus 0 I dislike Python 1 I like Python
This preprocessor creates a list column for each input column. Each list contains the frequency counts of tokens in order of their first appearance. For example:
corpus 0 [1, 1, 1, 0] # Counts for [I, dislike, Python, like] 1 [1, 0, 1, 1] # Counts for [I, dislike, Python, like]
Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import CountVectorizer >>> >>> df = pd.DataFrame({ ... "corpus": [ ... "Jimmy likes volleyball", ... "Bob likes volleyball too", ... "Bob also likes fruit jerky" ... ] ... }) >>> ds = ray.data.from_pandas(df) >>> >>> vectorizer = CountVectorizer(["corpus"]) >>> vectorizer.fit_transform(ds).to_pandas() corpus 0 [1, 0, 1, 1, 0, 0, 0, 0] 1 [1, 1, 1, 0, 0, 0, 0, 1] 2 [1, 1, 0, 0, 1, 1, 1, 0]
You can limit the number of tokens in the vocabulary with
max_features.>>> vectorizer = CountVectorizer(["corpus"], max_features=3) >>> vectorizer.fit_transform(ds).to_pandas() corpus 0 [1, 0, 1] 1 [1, 1, 1] 2 [1, 1, 0]
CountVectorizercan also be used in append mode by providing the name of the output_columns that should hold the encoded values.>>> vectorizer = CountVectorizer(["corpus"], output_columns=["corpus_counts"]) >>> vectorizer.fit_transform(ds).to_pandas() corpus corpus_counts 0 Jimmy likes volleyball [1, 0, 1, 1, 0, 0, 0, 0] 1 Bob likes volleyball too [1, 1, 1, 0, 0, 0, 0, 1] 2 Bob also likes fruit jerky [1, 1, 0, 0, 1, 1, 1, 0]
- Parameters:
columns – The columns to separately tokenize and count.
tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to
lambda s: s.split(" ").max_features – The maximum number of tokens to encode in the transformed dataset. If specified, only the most frequent tokens are encoded.
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columnsmust match the length ofcolumns, othwerwise an error will be raised.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Deserialize a preprocessor from serialized data.
Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Get the preprocessor class identifier for this preprocessor class.
Get the version number for this preprocessor class.
Batch format hint for upstream producers to try yielding best block format.
Serialize this preprocessor to a string or bytes.
Set the preprocessor class identifier for this preprocessor class.
Set the version number for this preprocessor class.
Transform the given dataset.
Transform a single batch of data.
Attributes