Ray Train Configuration User Guide#

The following overviews how to configure scale-out, run options, and fault-tolerance for Train. For more details on how to configure data ingest, also refer to Configuring Training Datasets.

Scaling Configurations in Train (ScalingConfig)#

The scaling configuration specifies distributed training properties like the number of workers or the resources per worker.

The properties of the scaling configuration are tunable.

ScalingConfig API reference

from ray.air import ScalingConfig

scaling_config = ScalingConfig(
    # Number of distributed workers.
    num_workers=2,
    # Turn on/off GPU.
    use_gpu=True,
    # Specify resources used for trainer.
    trainer_resources={"CPU": 1},
    # Try to schedule workers on different nodes.
    placement_strategy="SPREAD",
)

Run Configuration in Train (RunConfig)#

The run configuration specifies distributed training properties like the number of workers or the resources per worker.

The properties of the run configuration are not tunable.

RunConfig API reference

from ray.air import RunConfig

run_config = RunConfig(
    # Name of the training run (directory name).
    name="my_train_run",
    # Directory to store results in (will be local_dir/name).
    local_dir="~/ray_results",
    # Low training verbosity.
    verbose=1,
)

Failure configurations in Train (FailureConfig)#

The failure configuration specifies how training failures should be dealt with.

As part of the RunConfig, the properties of the failure configuration are not tunable.

FailureConfig API reference

from ray.air import RunConfig, FailureConfig

run_config = RunConfig(
    failure_config=FailureConfig(
        # Tries to recover a run up to this many times.
        max_failures=2
    )
)

Sync configurations in Train (SyncConfig)#

The sync configuration specifies how to synchronize checkpoints between the Ray cluster and remote storage.

As part of the RunConfig, the properties of the sync configuration are not tunable.

SyncConfig API reference

from ray.air import RunConfig
from ray.tune import SyncConfig

run_config = RunConfig(
    sync_config=SyncConfig(
        # This will store checkpoints on S3.
        upload_dir="s3://remote-bucket/location"
    )
)

Checkpoint configurations in Train (CheckpointConfig)#

The checkpoint configuration specifies how often to checkpoint training state and how many checkpoints to keep.

As part of the RunConfig, the properties of the checkpoint configuration are not tunable.

CheckpointConfig API reference

from ray.air import RunConfig, CheckpointConfig

run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        # Only keep this many checkpoints.
        num_to_keep=2
    )
)