ray.data.preprocessors.HashingVectorizer
ray.data.preprocessors.HashingVectorizer#
- class ray.data.preprocessors.HashingVectorizer(columns: List[str], num_features: int, tokenization_fn: Optional[Callable[[str], List[str]]] = None)[source]#
Bases:
ray.data.preprocessor.Preprocessor
Count the frequency of tokens using the hashing trick.
This preprocessors creates
num_features
columns named likehash_{column_name}_{index}
. Ifnum_features
is large enough relative to the size of your vocabulary, then each column approximately corresponds to the frequency of a unique token.HashingVectorizer
is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model.Note
This preprocessor transforms each input column to a document-term matrix.
A document-term matrix is a table that describes the frequency of tokens in a collection of documents. For example, the strings
"I like Python"
and"I dislike Python"
might have the document-term matrix below:corpus_I corpus_Python corpus_dislike corpus_like 0 1 1 1 0 1 1 1 0 1
To generate the matrix, you typically map each token to a unique index. For example:
token index 0 I 0 1 Python 1 2 dislike 2 3 like 3
The problem with this approach is that memory use scales linearly with the size of your vocabulary.
HashingVectorizer
circumvents this problem by computing indices with a hash function: \(\texttt{index} = hash(\texttt{token})\).Warning
Sparse matrices aren’t currently supported. If you use a large
num_features
, this preprocessor might behave poorly.Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import HashingVectorizer >>> >>> df = pd.DataFrame({ ... "corpus": [ ... "Jimmy likes volleyball", ... "Bob likes volleyball too", ... "Bob also likes fruit jerky" ... ] ... }) >>> ds = ray.data.from_pandas(df) >>> >>> vectorizer = HashingVectorizer(["corpus"], num_features=8) >>> vectorizer.fit_transform(ds).to_pandas() hash_corpus_0 hash_corpus_1 hash_corpus_2 hash_corpus_3 hash_corpus_4 hash_corpus_5 hash_corpus_6 hash_corpus_7 0 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 1 1 2 0 0 1 1 0 2 1 0
- Parameters
columns – The columns to separately tokenize and count.
num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens.
tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to
lambda s: s.split(" ")
.
See also
CountVectorizer
Another method for counting token frequencies. Unlike
HashingVectorizer
,CountVectorizer
creates a feature for each unique token. This enables you to compute the inverse transformation.FeatureHasher
This preprocessor is similar to
HashingVectorizer
, except it expects a table describing token frequencies. In contrast,FeatureHasher
expects a column containing documents.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.