ray.data.preprocessors.HashingVectorizer#

class ray.data.preprocessors.HashingVectorizer(columns: List[str], num_features: int, tokenization_fn: Callable[[str], List[str]] | None = None, *, output_columns: List[str] | None = None)[source]#

Bases: Preprocessor

Count the frequency of tokens using the hashing trick.

This preprocessors creates a list column for each input column. For each row, the list contains the frequency counts of tokens (for CountVectorizer) or hash values (for HashingVectorizer). For HashingVectorizer, the list will have length num_features. If num_features is large enough relative to the size of your vocabulary, then each index approximately corresponds to the frequency of a unique token.

HashingVectorizer is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model.

Note

This preprocessor transforms each input column to a document-term matrix.

A document-term matrix is a table that describes the frequency of tokens in a collection of documents. For example, the strings "I like Python" and "I dislike Python" might have the document-term matrix below:

    corpus_I  corpus_Python  corpus_dislike  corpus_like
0         1              1               1            0
1         1              1               0            1

To generate the matrix, you typically map each token to a unique index. For example:

        token  index
0        I      0
1   Python      1
2  dislike      2
3     like      3

The problem with this approach is that memory use scales linearly with the size of your vocabulary. HashingVectorizer circumvents this problem by computing indices with a hash function: \(\texttt{index} = hash(\texttt{token})\).

Warning

Sparse matrices aren’t currently supported. If you use a large num_features, this preprocessor might behave poorly.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import HashingVectorizer
>>>
>>> df = pd.DataFrame({
...     "corpus": [
...         "Jimmy likes volleyball",
...         "Bob likes volleyball too",
...         "Bob also likes fruit jerky"
...     ]
... })
>>> ds = ray.data.from_pandas(df)  
>>>
>>> vectorizer = HashingVectorizer(["corpus"], num_features=8)
>>> vectorizer.fit_transform(ds).to_pandas()  
                     corpus
0  [1, 0, 1, 0, 0, 0, 0, 1]
1  [1, 0, 1, 0, 0, 0, 1, 1]
2  [0, 0, 1, 1, 0, 2, 1, 0]

HashingVectorizer can also be used in append mode by providing the name of the output_columns that should hold the encoded values.

>>> vectorizer = HashingVectorizer(["corpus"], num_features=8, output_columns=["corpus_hashed"])
>>> vectorizer.fit_transform(ds).to_pandas()  
                       corpus             corpus_hashed
0      Jimmy likes volleyball  [1, 0, 1, 0, 0, 0, 0, 1]
1    Bob likes volleyball too  [1, 0, 1, 0, 0, 0, 1, 1]
2  Bob also likes fruit jerky  [0, 0, 1, 1, 0, 2, 1, 0]
Parameters:
  • columns – The columns to separately tokenize and count.

  • num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens.

  • tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to lambda s: s.split(" ").

  • output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

See also

CountVectorizer

Another method for counting token frequencies. Unlike HashingVectorizer, CountVectorizer creates a feature for each unique token. This enables you to compute the inverse transformation.

FeatureHasher

This preprocessor is similar to HashingVectorizer, except it expects a table describing token frequencies. In contrast, FeatureHasher expects a column containing documents.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Load the original preprocessor serialized via self.serialize().

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Return this preprocessor serialized as a string.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.