ray.data.preprocessors.FeatureHasher#
- class ray.data.preprocessors.FeatureHasher(columns: List[str], num_features: int, output_column: str)[source]#
Bases:
PreprocessorApply the hashing trick to a table that describes token frequencies.
FeatureHashercreatesnum_featurescolumns namedhash_{index}, whereindexranges from \(0\) tonum_features\(- 1\). The columnhash_{index}describes the frequency of tokens that hash toindex.Distinct tokens can correspond to the same index. However, if
num_featuresis large enough, then columns probably correspond to a unique token.This preprocessor is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model.
Warning
Sparse matrices aren’t supported. If you use a large
num_features, this preprocessor might behave poorly.Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import FeatureHasher
The data below describes the frequencies of tokens in
"I like Python"and"I dislike Python".>>> df = pd.DataFrame({ ... "I": [1, 1], ... "like": [1, 0], ... "dislike": [0, 1], ... "Python": [1, 1] ... }) >>> ds = ray.data.from_pandas(df)
FeatureHasherhashes each token to determine its index. For example, the index of"I"is \(hash(\\texttt{"I"}) \pmod 8 = 5\).>>> hasher = FeatureHasher(columns=["I", "like", "dislike", "Python"], num_features=8, output_column = "hashed") >>> hasher.fit_transform(ds)["hashed"].to_pandas().to_numpy() array([[0, 0, 0, 2, 0, 1, 0, 0], [0, 0, 0, 1, 0, 1, 1, 0]])
Notice the hash collision: both
"like"and"Python"correspond to index \(3\). You can avoid hash collisions like these by increasingnum_features.- Parameters:
columns – The columns to apply the hashing trick to. Each column should describe the frequency of a token.
num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens.
output_column – The name of the column that contains the hashed features.
See also
CountVectorizerUse this preprocessor to generate inputs for
FeatureHasher.ray.data.preprocessors.HashingVectorizerIf your input data describes documents rather than token frequencies, use
HashingVectorizer.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize().Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.