- class ray.train.CheckpointConfig(num_to_keep: int | None = None, checkpoint_score_attribute: str | None = None, checkpoint_score_order: str | None = 'max', checkpoint_frequency: int | None = 0, checkpoint_at_end: bool | None = None, _checkpoint_keep_all_ranks: bool | None = 'DEPRECATED', _checkpoint_upload_from_workers: bool | None = 'DEPRECATED')#
Configurable parameters for defining the checkpointing strategy.
Default behavior is to persist all checkpoints to disk. If
num_to_keepis set, the default retention policy is to keep the checkpoints with maximum timestamp, i.e. the most recent checkpoints.
num_to_keep – The number of checkpoints to keep on disk for this run. If a checkpoint is persisted to disk after there are already this many checkpoints, then an existing checkpoint will be deleted. If this is
Nonethen checkpoints will not be deleted. Must be >= 1.
checkpoint_score_attribute – The attribute that will be used to score checkpoints to determine which checkpoints should be kept on disk when there are greater than
num_to_keepcheckpoints. This attribute must be a key from the checkpoint dictionary which has a numerical value. Per default, the last checkpoints will be kept.
checkpoint_score_order – Either “max” or “min”. If “max”, then checkpoints with highest values of
checkpoint_score_attributewill be kept. If “min”, then checkpoints with lowest values of
checkpoint_score_attributewill be kept.
checkpoint_frequency – Number of iterations between checkpoints. If 0 this will disable checkpointing. Please note that most trainers will still save one checkpoint at the end of training. This attribute is only supported by trainers that don’t take in custom training loops.
checkpoint_at_end – If True, will save a checkpoint at the end of training. This attribute is only supported by trainers that don’t take in custom training loops. Defaults to True for trainers that support it and False for generic function trainables.
_checkpoint_keep_all_ranks – This experimental config is deprecated. This behavior is now controlled by reporting
checkpoint=Nonein the workers that shouldn’t persist a checkpoint. For example, if you only want the rank 0 worker to persist a checkpoint (e.g., in standard data parallel training), then you should save and report a checkpoint if
ray.train.get_context().get_world_rank() == 0and
_checkpoint_upload_from_workers – This experimental config is deprecated. Uploading checkpoint directly from the worker is now the default behavior.