ray.train.CheckpointConfig#
- class ray.train.CheckpointConfig(num_to_keep: int | None = None, checkpoint_score_attribute: str | None = None, checkpoint_score_order: str | None = 'max', checkpoint_frequency: int | None = 0, checkpoint_at_end: bool | None = None, _checkpoint_keep_all_ranks: bool | None = 'DEPRECATED', _checkpoint_upload_from_workers: bool | None = 'DEPRECATED')#
Configurable parameters for defining the checkpointing strategy.
Default behavior is to persist all checkpoints to disk. If
num_to_keep
is set, the default retention policy is to keep the checkpoints with maximum timestamp, i.e. the most recent checkpoints.- Parameters:
num_to_keep – The number of checkpoints to keep on disk for this run. If a checkpoint is persisted to disk after there are already this many checkpoints, then an existing checkpoint will be deleted. If this is
None
then checkpoints will not be deleted. Must be >= 1.checkpoint_score_attribute – The attribute that will be used to score checkpoints to determine which checkpoints should be kept on disk when there are greater than
num_to_keep
checkpoints. This attribute must be a key from the checkpoint dictionary which has a numerical value. Per default, the last checkpoints will be kept.checkpoint_score_order – Either “max” or “min”. If “max”, then checkpoints with highest values of
checkpoint_score_attribute
will be kept. If “min”, then checkpoints with lowest values ofcheckpoint_score_attribute
will be kept.checkpoint_frequency – Number of iterations between checkpoints. If 0 this will disable checkpointing. Please note that most trainers will still save one checkpoint at the end of training. This attribute is only supported by trainers that don’t take in custom training loops.
checkpoint_at_end – If True, will save a checkpoint at the end of training. This attribute is only supported by trainers that don’t take in custom training loops. Defaults to True for trainers that support it and False for generic function trainables.
_checkpoint_keep_all_ranks – This experimental config is deprecated. This behavior is now controlled by reporting
checkpoint=None
in the workers that shouldn’t persist a checkpoint. For example, if you only want the rank 0 worker to persist a checkpoint (e.g., in standard data parallel training), then you should save and report a checkpoint ifray.train.get_context().get_world_rank() == 0
andNone
otherwise._checkpoint_upload_from_workers – This experimental config is deprecated. Uploading checkpoint directly from the worker is now the default behavior.
Methods
Attributes