ray.data.preprocessors.FeatureHasher#

class ray.data.preprocessors.FeatureHasher(columns: List[str], num_features: int, output_column: str)[source]#

Bases: SerializablePreprocessorBase

Apply the hashing trick to a table that describes token frequencies.

FeatureHasher creates num_features columns named hash_{index}, where index ranges from \(0\) to num_features\(- 1\). The column hash_{index} describes the frequency of tokens that hash to index.

Distinct tokens can correspond to the same index. However, if num_features is large enough, then columns probably correspond to a unique token.

This preprocessor is memory efficient and quick to pickle. However, given a transformed column, you can’t know which tokens correspond to it. This might make it hard to determine which tokens are important to your model.

Warning

Sparse matrices aren’t supported. If you use a large num_features, this preprocessor might behave poorly.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import FeatureHasher

The data below describes the frequencies of tokens in "I like Python" and "I dislike Python".

>>> df = pd.DataFrame({
...     "I": [1, 1],
...     "like": [1, 0],
...     "dislike": [0, 1],
...     "Python": [1, 1]
... })
>>> ds = ray.data.from_pandas(df)  

FeatureHasher hashes each token to determine its index. For example, the index of "I" is \(hash(\\texttt{"I"}) \pmod 8 = 5\).

>>> hasher = FeatureHasher(columns=["I", "like", "dislike", "Python"], num_features=8, output_column = "hashed")
>>> hasher.fit_transform(ds)["hashed"].to_pandas().to_numpy()  
array([[0, 0, 0, 2, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0]])

Notice the hash collision: both "like" and "Python" correspond to index \(3\). You can avoid hash collisions like these by increasing num_features.

Parameters:
  • columns – The columns to apply the hashing trick to. Each column should describe the frequency of a token.

  • num_features – The number of features used to represent the vocabulary. You should choose a value large enough to prevent hash collisions between distinct tokens.

  • output_column – The name of the column that contains the hashed features.

See also

CountVectorizer

Use this preprocessor to generate inputs for FeatureHasher.

ray.data.preprocessors.HashingVectorizer

If your input data describes documents rather than token frequencies, use HashingVectorizer.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Deserialize a preprocessor from serialized data.

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

get_preprocessor_class_id

Get the preprocessor class identifier for this preprocessor class.

get_version

Get the version number for this preprocessor class.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Serialize this preprocessor to a string or bytes.

set_preprocessor_class_id

Set the preprocessor class identifier for this preprocessor class.

set_version

Set the version number for this preprocessor class.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.

Attributes

MAGIC_CLOUDPICKLE

SERIALIZER_FORMAT_VERSION

columns

num_features

output_column

stat_computation_plan