ray.data.datasource.FilenameProvider.get_filename_for_block#

FilenameProvider.get_filename_for_block(block: pyarrow.Table | pandas.DataFrame | None, write_uuid: str, task_index: int, block_index: int) str[source]#

Generate a filename for a block of data.

Note

Filenames must be unique and deterministic for a given write UUID and task index. Do NOT depend on block content or block_index.

Checkpointing requires predicting the output filename BEFORE writing data. This enables 2-phase commit: if a write fails after creating the file but before committing the checkpoint, recovery can use the predicted filename to delete orphaned files and retry cleanly. If filenames depend on block content, this prediction is impossible and checkpointing cannot guarantee exactly-once semantics.

Parameters:
  • block – Deprecated, unused. Do not depend on block content.

  • write_uuid – The UUID of the write operation.

  • task_index – The index of the write task.

  • block_index – Deprecated, always 0. Do not depend on this value.

Warning

DEPRECATED: This API is deprecated and may be removed in future Ray releases. Use get_filename_for_task() instead. The block and block_index parameters are unused in practice because datasinks merge all blocks into one before writing. These parameters will be removed in a future release. Do not depend on block content or block_index in your FilenameProvider implementation - filenames must be deterministic from (write_uuid, task_index) alone to ensure checkpointing correctness.