ray.data.preprocessors.FeatureHasher#

class ray.data.preprocessors.FeatureHasher(columns: List[str], num_features: int, output_column: str)[source]#

Bases: Preprocessor

Apply the hashing trick to a table that describes token frequencies.

FeatureHasher creates num_features columns named hash_{index}, where index ranges from \(0\) to num_features\(- 1\). The column hash_{index} describes the frequency of tokens that hash to index.

Distinct tokens can correspond to the same index. However, if num_features is large enough, then columns probably correspond to a unique token.

This preprocessor is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model.

Warning

Sparse matrices aren’t supported. If you use a large num_features, this preprocessor might behave poorly.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import FeatureHasher

The data below describes the frequencies of tokens in "I like Python" and "I dislike Python".

>>> df = pd.DataFrame({
...     "I": [1, 1],
...     "like": [1, 0],
...     "dislike": [0, 1],
...     "Python": [1, 1]
... })
>>> ds = ray.data.from_pandas(df)  

FeatureHasher hashes each token to determine its index. For example, the index of "I" is \(hash(\\texttt{"I"}) \pmod 8 = 5\).

>>> hasher = FeatureHasher(columns=["I", "like", "dislike", "Python"], num_features=8, output_column = "hashed")
>>> hasher.fit_transform(ds)["hashed"].to_pandas().to_numpy()  
array([[0, 0, 0, 2, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0]])

Notice the hash collision: both "like" and "Python" correspond to index \(3\). You can avoid hash collisions like these by increasing num_features.

Parameters:

columns – The columns to apply the hashing trick to. Each column should describe the frequency of a token.
num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens.
output_column – The name of the column that contains the hashed features.

See also

CountVectorizer: Use this preprocessor to generate inputs for FeatureHasher.
ray.data.preprocessors.HashingVectorizer: If your input data describes documents rather than token frequencies, use HashingVectorizer.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`	Load the original preprocessor serialized via `self.serialize()`.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Return this preprocessor serialized as a string.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.