ray.data.preprocessors.HashingVectorizer#
- class ray.data.preprocessors.HashingVectorizer(columns: List[str], num_features: int, tokenization_fn: Callable[[str], List[str]] | None = None, *, output_columns: List[str] | None = None)[source]#
Bases:
Preprocessor
Count the frequency of tokens using the hashing trick.
This preprocessors creates a list column for each input column. For each row, the list contains the frequency counts of tokens (for CountVectorizer) or hash values (for HashingVectorizer). For HashingVectorizer, the list will have length
num_features
. Ifnum_features
is large enough relative to the size of your vocabulary, then each index approximately corresponds to the frequency of a unique token.HashingVectorizer
is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model.Note
This preprocessor transforms each input column to a document-term matrix.
A document-term matrix is a table that describes the frequency of tokens in a collection of documents. For example, the strings
"I like Python"
and"I dislike Python"
might have the document-term matrix below:corpus_I corpus_Python corpus_dislike corpus_like 0 1 1 1 0 1 1 1 0 1
To generate the matrix, you typically map each token to a unique index. For example:
token index 0 I 0 1 Python 1 2 dislike 2 3 like 3
The problem with this approach is that memory use scales linearly with the size of your vocabulary.
HashingVectorizer
circumvents this problem by computing indices with a hash function: \(\texttt{index} = hash(\texttt{token})\).Warning
Sparse matrices aren’t currently supported. If you use a large
num_features
, this preprocessor might behave poorly.Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import HashingVectorizer >>> >>> df = pd.DataFrame({ ... "corpus": [ ... "Jimmy likes volleyball", ... "Bob likes volleyball too", ... "Bob also likes fruit jerky" ... ] ... }) >>> ds = ray.data.from_pandas(df) >>> >>> vectorizer = HashingVectorizer(["corpus"], num_features=8) >>> vectorizer.fit_transform(ds).to_pandas() corpus 0 [1, 0, 1, 0, 0, 0, 0, 1] 1 [1, 0, 1, 0, 0, 0, 1, 1] 2 [0, 0, 1, 1, 0, 2, 1, 0]
HashingVectorizer
can also be used in append mode by providing the name of the output_columns that should hold the encoded values.>>> vectorizer = HashingVectorizer(["corpus"], num_features=8, output_columns=["corpus_hashed"]) >>> vectorizer.fit_transform(ds).to_pandas() corpus corpus_hashed 0 Jimmy likes volleyball [1, 0, 1, 0, 0, 0, 0, 1] 1 Bob likes volleyball too [1, 0, 1, 0, 0, 0, 1, 1] 2 Bob also likes fruit jerky [0, 0, 1, 1, 0, 2, 1, 0]
- Parameters:
columns – The columns to separately tokenize and count.
num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens.
tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to
lambda s: s.split(" ")
.output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columns
must match the length ofcolumns
, othwerwise an error will be raised.
See also
CountVectorizer
Another method for counting token frequencies. Unlike
HashingVectorizer
,CountVectorizer
creates a feature for each unique token. This enables you to compute the inverse transformation.FeatureHasher
This preprocessor is similar to
HashingVectorizer
, except it expects a table describing token frequencies. In contrast,FeatureHasher
expects a column containing documents.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.