ray.data.datasource.RowBasedFileDatasink#

class ray.data.datasource.RowBasedFileDatasink(path: str, *, filesystem: pyarrow.fs.FileSystem | None = None, try_create_dir: bool = True, open_stream_args: Dict[str, Any] | None = None, filename_provider: FilenameProvider | None = None, dataset_uuid: str | None = None, file_format: str | None = None)[source]#

Bases: _FileDatasink

A datasink that writes one row to each file.

Subclasses must implement write_row_to_file and call the superclass constructor.

Examples

import io
from typing import Any, Dict

import pyarrow
from PIL import Image

from ray.data.datasource import RowBasedFileDatasink

class ImageDatasink(RowBasedFileDatasink):
    def __init__(self, path: str, *, column: str, file_format: str = "png"):
        super().__init__(path, file_format=file_format)
        self._file_format = file_format
        self._column = column

    def write_row_to_file(self, row: Dict[str, Any], file: "pyarrow.NativeFile"):
        image = Image.fromarray(row[self._column])
        buffer = io.BytesIO()
        image.save(buffer, format=self._file_format)
        file.write(buffer.getvalue())

DeveloperAPI: This API may change across minor Ray releases.

Methods

__init__

Initialize this datasink.

get_name

Return a human-readable name for this datasink.

on_write_failed

Callback for when a write job fails.

on_write_start

Create a directory to write files to.

write_row_to_file

Write a row to a file.

Attributes

num_rows_per_write

The target number of rows to pass to each write() call.

supports_distributed_writes

If False, only launch write tasks on the driver's node.