ray.data.preprocessors.Tokenizer#

class ray.data.preprocessors.Tokenizer(columns: List[str], tokenization_fn: Optional[Callable[[str], List[str]]] = None)[source]#

Bases: ray.data.preprocessor.Preprocessor

Replace each string with a list of tokens.

Examples

>>> import pandas as pd
>>> import ray
>>> df = pd.DataFrame({"text": ["Hello, world!", "foo bar\nbaz"]})
>>> ds = ray.data.from_pandas(df)  

The default tokenization_fn delimits strings using the space character.

>>> from ray.data.preprocessors import Tokenizer
>>> tokenizer = Tokenizer(columns=["text"])
>>> tokenizer.transform(ds).to_pandas()  
               text
0  [Hello,, world!]
1   [foo, bar\nbaz]

If the default logic isn’t adequate for your use case, you can specify a custom tokenization_fn.

>>> import string
>>> def tokenization_fn(s):
...     for character in string.punctuation:
...         s = s.replace(character, "")
...     return s.split()
>>> tokenizer = Tokenizer(columns=["text"], tokenization_fn=tokenization_fn)
>>> tokenizer.transform(ds).to_pandas()  
              text
0   [Hello, world]
1  [foo, bar, baz]
Parameters
  • columns – The columns to tokenize.

  • tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to lambda s: s.split(" ").

PublicAPI (alpha): This API is in alpha and may change before becoming stable.