ray.data.preprocessors.CountVectorizer
ray.data.preprocessors.CountVectorizer#
- class ray.data.preprocessors.CountVectorizer(columns: List[str], tokenization_fn: Optional[Callable[[str], List[str]]] = None, max_features: Optional[int] = None)[source]#
Bases:
ray.data.preprocessor.Preprocessor
Count the frequency of tokens in a column of strings.
CountVectorizer
operates on columns that contain strings. For example:corpus 0 I dislike Python 1 I like Python
This preprocessors creates a column named like
{column}_{token}
for each unique token. These columns represent the frequency of token{token}
in column{column}
. For example:corpus_I corpus_Python corpus_dislike corpus_like 0 1 1 1 0 1 1 1 0 1
Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import CountVectorizer >>> >>> df = pd.DataFrame({ ... "corpus": [ ... "Jimmy likes volleyball", ... "Bob likes volleyball too", ... "Bob also likes fruit jerky" ... ] ... }) >>> ds = ray.data.from_pandas(df) >>> >>> vectorizer = CountVectorizer(["corpus"]) >>> vectorizer.fit_transform(ds).to_pandas() corpus_likes corpus_volleyball corpus_Bob corpus_Jimmy corpus_too corpus_also corpus_fruit corpus_jerky 0 1 1 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 2 1 0 1 0 0 1 1 1
You can limit the number of tokens in the vocabulary with
max_features
.>>> vectorizer = CountVectorizer(["corpus"], max_features=3) >>> vectorizer.fit_transform(ds).to_pandas() corpus_likes corpus_volleyball corpus_Bob 0 1 1 0 1 1 1 1 2 1 0 1
- Parameters
columns – The columns to separately tokenize and count.
tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to
lambda s: s.split(" ")
.max_features – The maximum number of tokens to encode in the transformed dataset. If specified, only the most frequent tokens are encoded.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.