Ray Train Configuration User Guide
Contents
Ray Train Configuration User Guide#
The following overviews how to configure scale-out, run options, and fault-tolerance for Train. For more details on how to configure data ingest, also refer to Configuring Training Datasets.
Scaling Configurations in Train (ScalingConfig
)#
The scaling configuration specifies distributed training properties like the number of workers or the resources per worker.
The properties of the scaling configuration are tunable.
from ray.air import ScalingConfig
scaling_config = ScalingConfig(
# Number of distributed workers.
num_workers=2,
# Turn on/off GPU.
use_gpu=True,
# Specify resources used for trainer.
trainer_resources={"CPU": 1},
# Try to schedule workers on different nodes.
placement_strategy="SPREAD",
)
Run Configuration in Train (RunConfig
)#
The run configuration specifies distributed training properties like the number of workers or the resources per worker.
The properties of the run configuration are not tunable.
from ray.air import RunConfig
run_config = RunConfig(
# Name of the training run (directory name).
name="my_train_run",
# Directory to store results in (will be local_dir/name).
local_dir="~/ray_results",
# Low training verbosity.
verbose=1,
)
Failure configurations in Train (FailureConfig
)#
The failure configuration specifies how training failures should be dealt with.
As part of the RunConfig, the properties of the failure configuration are not tunable.
from ray.air import RunConfig, FailureConfig
run_config = RunConfig(
failure_config=FailureConfig(
# Tries to recover a run up to this many times.
max_failures=2
)
)
Sync configurations in Train (SyncConfig
)#
The sync configuration specifies how to synchronize checkpoints between the Ray cluster and remote storage.
As part of the RunConfig, the properties of the sync configuration are not tunable.
from ray.air import RunConfig
from ray.tune import SyncConfig
run_config = RunConfig(
sync_config=SyncConfig(
# This will store checkpoints on S3.
upload_dir="s3://remote-bucket/location"
)
)
Checkpoint configurations in Train (CheckpointConfig
)#
The checkpoint configuration specifies how often to checkpoint training state and how many checkpoints to keep.
As part of the RunConfig, the properties of the checkpoint configuration are not tunable.
CheckpointConfig API reference
from ray.air import RunConfig, CheckpointConfig
run_config = RunConfig(
checkpoint_config=CheckpointConfig(
# Only keep this many checkpoints.
num_to_keep=2
)
)