ray.data.preprocessors.FeatureHasher
ray.data.preprocessors.FeatureHasher#
- class ray.data.preprocessors.FeatureHasher(columns: List[str], num_features: int)[source]#
Bases:
ray.data.preprocessor.Preprocessor
Apply the hashing trick to a table that describes token frequencies.
FeatureHasher
createsnum_features
columns namedhash_{index}
, whereindex
ranges from \(0\) tonum_features
\(- 1\). The columnhash_{index}
describes the frequency of tokens that hash toindex
.Distinct tokens can correspond to the same index. However, if
num_features
is large enough, then columns probably correspond to a unique token.This preprocessor is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model.
Warning
Sparse matrices aren’t supported. If you use a large
num_features
, this preprocessor might behave poorly.Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import FeatureHasher
The data below describes the frequencies of tokens in
"I like Python"
and"I dislike Python"
.>>> df = pd.DataFrame({ ... "I": [1, 1], ... "like": [1, 0], ... "dislike": [0, 1], ... "Python": [1, 1] ... }) >>> ds = ray.data.from_pandas(df)
FeatureHasher
hashes each token to determine its index. For example, the index of"I"
is \(hash(\texttt{"I"}) \pmod 8 = 5\).>>> hasher = FeatureHasher(columns=["I", "like", "dislike", "Python"], num_features=8) >>> hasher.fit_transform(ds).to_pandas().to_numpy() array([[0, 0, 0, 2, 0, 1, 0, 0], [0, 0, 0, 1, 0, 1, 1, 0]])
Notice the hash collision: both
"like"
and"Python"
correspond to index \(3\). You can avoid hash collisions like these by increasingnum_features
.- Parameters
columns – The columns to apply the hashing trick to. Each column should describe the frequency of a token.
num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens.
See also
CountVectorizer
Use this preprocessor to generate inputs for
FeatureHasher
.ray.data.preprocessors.HashingVectorizer
If your input data describes documents rather than token frequencies, use
HashingVectorizer
.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.