Working with Text ================= With Ray Data, you can easily read and transform large amounts of text data. This guide shows you how to: * :ref:`Read text files ` * :ref:`Transform text data ` * :ref:`Perform inference on text data ` * :ref:`Save text data ` .. _reading-text-files: Reading text files ------------------ Ray Data can read lines of text and JSONL. Alternatively, you can read raw binary files and manually decode data. .. tab-set:: .. tab-item:: Text lines To read lines of text, call :func:`~ray.data.read_text`. Ray Data creates a row for each line of text. In the schema, the column name defaults to "text". .. testcode:: import ray ds = ray.data.read_text("s3://anonymous@ray-example-data/this.txt") ds.show(3) .. testoutput:: {'text': 'The Zen of Python, by Tim Peters'} {'text': 'Beautiful is better than ugly.'} {'text': 'Explicit is better than implicit.'} .. tab-item:: JSON Lines `JSON Lines `_ is a text format for structured data. It's typically used to process data one record at a time. To read JSON Lines files, call :func:`~ray.data.read_json`. Ray Data creates a row for each JSON object. .. testcode:: import ray ds = ray.data.read_json("s3://anonymous@ray-example-data/logs.json") ds.show(3) .. testoutput:: {'timestamp': datetime.datetime(2022, 2, 8, 15, 43, 41), 'size': 48261360} {'timestamp': datetime.datetime(2011, 12, 29, 0, 19, 10), 'size': 519523} {'timestamp': datetime.datetime(2028, 9, 9, 5, 6, 7), 'size': 2163626} .. tab-item:: Other formats To read other text formats, call :func:`~ray.data.read_binary_files`. Then, call :meth:`~ray.data.Dataset.map` to decode your data. .. testcode:: from typing import Any, Dict from bs4 import BeautifulSoup import ray def parse_html(row: Dict[str, Any]) -> Dict[str, Any]: html = row["bytes"].decode("utf-8") soup = BeautifulSoup(html, features="html.parser") return {"text": soup.get_text().strip()} ds = ( ray.data.read_binary_files("s3://anonymous@ray-example-data/index.html") .map(parse_html) ) ds.show() .. testoutput:: {'text': 'Batoidea\nBatoidea is a superorder of cartilaginous fishes...'} For more information on reading files, see :ref:`Loading data `. .. _transforming-text: Transforming text ----------------- To transform text, implement your transformation in a function or callable class. Then, call :meth:`Dataset.map() ` or :meth:`Dataset.map_batches() `. Ray Data transforms your text in parallel. .. testcode:: from typing import Any, Dict import ray def to_lower(row: Dict[str, Any]) -> Dict[str, Any]: row["text"] = row["text"].lower() return row ds = ( ray.data.read_text("s3://anonymous@ray-example-data/this.txt") .map(to_lower) ) ds.show(3) .. testoutput:: {'text': 'the zen of python, by tim peters'} {'text': 'beautiful is better than ugly.'} {'text': 'explicit is better than implicit.'} For more information on transforming data, see :ref:`Transforming data `. .. _performing-inference-on-text: Performing inference on text ---------------------------- To perform inference with a pre-trained model on text data, implement a callable class that sets up and invokes a model. Then, call :meth:`Dataset.map_batches() `. .. testcode:: from typing import Dict import numpy as np from transformers import pipeline import ray class TextClassifier: def __init__(self): self.model = pipeline("text-classification") def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]: predictions = self.model(list(batch["text"])) batch["label"] = [prediction["label"] for prediction in predictions] return batch ds = ( ray.data.read_text("s3://anonymous@ray-example-data/this.txt") .map_batches(TextClassifier, compute=ray.data.ActorPoolStrategy(size=2), batch_size="auto") ) ds.show(3) .. testoutput:: {'text': 'The Zen of Python, by Tim Peters', 'label': 'POSITIVE'} {'text': 'Beautiful is better than ugly.', 'label': 'POSITIVE'} {'text': 'Explicit is better than implicit.', 'label': 'POSITIVE'} For more information on handling large language models, see :ref:`Working with LLMs `. For more information on performing inference, see :ref:`End-to-end: Offline Batch Inference ` and :ref:`Stateful Transforms `. .. _saving-text: Saving text ----------- To save text, call a method like :meth:`~ray.data.Dataset.write_parquet`. Ray Data can save text in many formats. To view the full list of supported file formats, see the :ref:`Saving Data API `. .. testcode:: import ray ds = ray.data.read_text("s3://anonymous@ray-example-data/this.txt") ds.write_parquet("local:///tmp/results") For more information on saving data, see :ref:`Saving data `.