ray.data.preprocessors.HashingVectorizer#

class ray.data.preprocessors.HashingVectorizer(columns: List[str], num_features: int, tokenization_fn: Optional[Callable[[str], List[str]]] = None)[source]#

Bases: ray.data.preprocessor.Preprocessor

Count the frequency of tokens using the hashing trick.

This preprocessors creates num_features columns named like hash_{column_name}_{index}. If num_features is large enough relative to the size of your vocabulary, then each column approximately corresponds to the frequency of a unique token.

HashingVectorizer is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model.

Note

This preprocessor transforms each input column to a document-term matrix.

A document-term matrix is a table that describes the frequency of tokens in a collection of documents. For example, the strings "I like Python" and "I dislike Python" might have the document-term matrix below:

    corpus_I  corpus_Python  corpus_dislike  corpus_like
0         1              1               1            0
1         1              1               0            1

To generate the matrix, you typically map each token to a unique index. For example:

        token  index
0        I      0
1   Python      1
2  dislike      2
3     like      3

The problem with this approach is that memory use scales linearly with the size of your vocabulary. HashingVectorizer circumvents this problem by computing indices with a hash function: \(\texttt{index} = hash(\texttt{token})\).

Warning

Sparse matrices aren’t currently supported. If you use a large num_features, this preprocessor might behave poorly.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import HashingVectorizer
>>>
>>> df = pd.DataFrame({
...     "corpus": [
...         "Jimmy likes volleyball",
...         "Bob likes volleyball too",
...         "Bob also likes fruit jerky"
...     ]
... })
>>> ds = ray.data.from_pandas(df)  
>>>
>>> vectorizer = HashingVectorizer(["corpus"], num_features=8)
>>> vectorizer.fit_transform(ds).to_pandas()  
   hash_corpus_0  hash_corpus_1  hash_corpus_2  hash_corpus_3  hash_corpus_4  hash_corpus_5  hash_corpus_6  hash_corpus_7
0              1              0              1              0              0              0              0              1
1              1              0              1              0              0              0              1              1
2              0              0              1              1              0              2              1              0
Parameters
  • columns – The columns to separately tokenize and count.

  • num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens.

  • tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to lambda s: s.split(" ").

See also

CountVectorizer

Another method for counting token frequencies. Unlike HashingVectorizer, CountVectorizer creates a feature for each unique token. This enables you to compute the inverse transformation.

FeatureHasher

This preprocessor is similar to HashingVectorizer, except it expects a table describing token frequencies. In contrast, FeatureHasher expects a column containing documents.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.