Syncing (tune.SyncConfig, tune.Syncer)
Contents
Syncing (tune.SyncConfig, tune.Syncer)#
SyncConfig#
- class ray.tune.syncer.SyncConfig(upload_dir: Optional[str] = None, syncer: Optional[Union[str, ray.tune.syncer.Syncer]] = 'auto', sync_on_checkpoint: bool = True, sync_period: int = 300, sync_timeout: int = 1800)[source]#
Configuration object for syncing.
If an
upload_dir
is specified, both experiment and trial checkpoints will be stored on remote (cloud) storage. Synchronization then only happens via this remote storage.- Parameters
upload_dir – Optional URI to sync training results and checkpoints to (e.g.
s3://bucket
,gs://bucket
orhdfs://path
). Specifying this will enable cloud-based checkpointing.syncer – Syncer class to use for synchronizing checkpoints to/from cloud storage. If set to
None
, no syncing will take place. Defaults to"auto"
(auto detect).sync_on_checkpoint – Force sync-down of trial checkpoint to driver (only non cloud-storage). If set to False, checkpoint syncing from worker to driver is asynchronous and best-effort. This does not affect persistent storage syncing. Defaults to True.
sync_period – Syncing period for syncing between nodes.
sync_timeout – Timeout after which running sync processes are aborted. Currently only affects trial-to-cloud syncing.
PublicAPI: This API is stable across Ray releases.
Syncer#
- class ray.tune.syncer.Syncer(sync_period: float = 300.0)[source]#
Syncer class for synchronizing data between Ray nodes and external storage.
This class handles data transfer for two cases:
Synchronizing data from the driver to external storage. This affects experiment-level checkpoints and trial-level checkpoints if no cloud storage is used.
Synchronizing data from remote trainables to external storage.
Synchronizing tasks are usually asynchronous and can be awaited using
wait()
. The base class implements await_or_retry()
API that will retry a failed sync command.The base class also exposes an API to only kick off syncs every
sync_period
seconds.DeveloperAPI: This API may change across minor Ray releases.
- abstract sync_up(local_dir: str, remote_dir: str, exclude: Optional[List] = None) bool [source]#
Synchronize local directory to remote directory.
This function can spawn an asynchronous process that can be awaited in
wait()
.- Parameters
local_dir – Local directory to sync from.
remote_dir – Remote directory to sync up to. This is an URI (
protocol://remote/path
).exclude – Pattern of files to exclude, e.g.
["*/checkpoint_*]
to exclude trial checkpoints.
- Returns
True if sync process has been spawned, False otherwise.
- abstract sync_down(remote_dir: str, local_dir: str, exclude: Optional[List] = None) bool [source]#
Synchronize remote directory to local directory.
This function can spawn an asynchronous process that can be awaited in
wait()
.- Parameters
remote_dir – Remote directory to sync down from. This is an URI (
protocol://remote/path
).local_dir – Local directory to sync to.
exclude – Pattern of files to exclude, e.g.
["*/checkpoint_*]
to exclude trial checkpoints.
- Returns
True if sync process has been spawned, False otherwise.
- abstract delete(remote_dir: str) bool [source]#
Delete directory on remote storage.
This function can spawn an asynchronous process that can be awaited in
wait()
.- Parameters
remote_dir – Remote directory to delete. This is an URI (
protocol://remote/path
).- Returns
True if sync process has been spawned, False otherwise.
- retry()[source]#
Retry the last sync up, sync down, or delete command.
You should implement this method if you spawn asynchronous syncing processes.
- wait()[source]#
Wait for asynchronous sync command to finish.
You should implement this method if you spawn asynchronous syncing processes.
- sync_up_if_needed(local_dir: str, remote_dir: str, exclude: Optional[List] = None) bool [source]#
Syncs up if time since last sync up is greater than sync_period.
- Parameters
local_dir – Local directory to sync from.
remote_dir – Remote directory to sync up to. This is an URI (
protocol://remote/path
).exclude – Pattern of files to exclude, e.g.
["*/checkpoint_*]
to exclude trial checkpoints.
- sync_down_if_needed(remote_dir: str, local_dir: str, exclude: Optional[List] = None)[source]#
Syncs down if time since last sync down is greater than sync_period.
- Parameters
remote_dir – Remote directory to sync down from. This is an URI (
protocol://remote/path
).local_dir – Local directory to sync to.
exclude – Pattern of files to exclude, e.g.
["*/checkpoint_*]
to exclude trial checkpoints.