ray.data.checkpoint.interfaces.CheckpointConfig#

class ray.data.checkpoint.interfaces.CheckpointConfig(id_column: str | None = None, checkpoint_path: str | None = None, *, delete_checkpoint_on_success: bool = True, override_filesystem: pyarrow.fs.FileSystem | None = None, override_backend: CheckpointBackend | None = None, filter_num_threads: int = 3, write_num_threads: int = 3, checkpoint_path_partition_filter: PathPartitionFilter | None = None)[source]#

Configuration for checkpointing.

Parameters:
  • id_column – Name of the ID column in the input dataset. ID values must be unique across all rows in the dataset and must persist during all operators.

  • checkpoint_path – Path to store the checkpoint data. It can be a path to a cloud object storage (e.g. s3://bucket/path) or a file system path. If the latter, the path must be a network-mounted file system (e.g. /mnt/cluster_storage/) that is accessible to the entire cluster. If not set, defaults to RAY_DATA_CHECKPOINT_PATH_BUCKET/ray_data_checkpoint.

  • delete_checkpoint_on_success – If true, automatically delete checkpoint data when the dataset execution succeeds. Only supported for batch-based backend currently.

  • override_filesystem – Override the pyarrow.fs.FileSystem object used to read/write checkpoint data. Use this when you want to use custom credentials.

  • override_backend – Override the CheckpointBackend object used to access the checkpoint backend storage.

  • filter_num_threads – Number of threads used to filter checkpointed rows.

  • write_num_threads – Number of threads used to write checkpoint files for completed rows.

  • checkpoint_path_partition_filter – Filter for checkpoint files to load during restoration when reading from checkpoint_path.

PublicAPI (beta): This API is in beta and may change before becoming stable.