ray.train.CheckpointConfig#

class ray.train.CheckpointConfig(num_to_keep: int | None = None, checkpoint_score_attribute: str | None = None, checkpoint_score_order: str | None = 'max', checkpoint_frequency: int | None = 0, checkpoint_at_end: bool | None = None, _checkpoint_keep_all_ranks: bool | None = 'DEPRECATED', _checkpoint_upload_from_workers: bool | None = 'DEPRECATED')#

Configurable parameters for defining the checkpointing strategy.

Default behavior is to persist all checkpoints to disk. If num_to_keep is set, the default retention policy is to keep the checkpoints with maximum timestamp, i.e. the most recent checkpoints.

Parameters:
  • num_to_keep – The number of checkpoints to keep on disk for this run. If a checkpoint is persisted to disk after there are already this many checkpoints, then an existing checkpoint will be deleted. If this is None then checkpoints will not be deleted. Must be >= 1.

  • checkpoint_score_attribute – The attribute that will be used to score checkpoints to determine which checkpoints should be kept on disk when there are greater than num_to_keep checkpoints. This attribute must be a key from the checkpoint dictionary which has a numerical value. Per default, the last checkpoints will be kept.

  • checkpoint_score_order – Either “max” or “min”. If “max”, then checkpoints with highest values of checkpoint_score_attribute will be kept. If “min”, then checkpoints with lowest values of checkpoint_score_attribute will be kept.

  • checkpoint_frequency – Number of iterations between checkpoints. If 0 this will disable checkpointing. Please note that most trainers will still save one checkpoint at the end of training. This attribute is only supported by trainers that don’t take in custom training loops.

  • checkpoint_at_end – If True, will save a checkpoint at the end of training. This attribute is only supported by trainers that don’t take in custom training loops. Defaults to True for trainers that support it and False for generic function trainables.

  • _checkpoint_keep_all_ranks – This experimental config is deprecated. This behavior is now controlled by reporting checkpoint=None in the workers that shouldn’t persist a checkpoint. For example, if you only want the rank 0 worker to persist a checkpoint (e.g., in standard data parallel training), then you should save and report a checkpoint if ray.train.get_context().get_world_rank() == 0 and None otherwise.

  • _checkpoint_upload_from_workers – This experimental config is deprecated. Uploading checkpoint directly from the worker is now the default behavior.

Methods

Attributes

checkpoint_at_end

checkpoint_frequency

checkpoint_score_attribute

checkpoint_score_order

num_to_keep