Configuring Scale and GPUs#

Increasing the scale of a Ray Train training run is simple and can be done in a few lines of code. The main interface for this is the ScalingConfig, which configures the number of workers and the resources they should use.

In this guide, a worker refers to a Ray Train distributed training worker, which is a Ray Actor that runs your training function.

Increasing the number of workers#

The main interface to control parallelism in your training code is to set the number of workers. This can be done by passing the num_workers attribute to the ScalingConfig:

from ray.train import ScalingConfig

scaling_config = ScalingConfig(
    num_workers=8
)

Using GPUs#

To use GPUs, pass use_gpu=True to the ScalingConfig. This will request one GPU per training worker. In the example below, training will run on 8 GPUs (8 workers, each using one GPU).

from ray.train import ScalingConfig

scaling_config = ScalingConfig(
    num_workers=8,
    use_gpu=True
)

Using GPUs in the training function#

When use_gpu=True is set, Ray Train will automatically set up environment variables in your training function so that the GPUs can be detected and used (e.g. CUDA_VISIBLE_DEVICES).

You can get the associated devices with ray.train.torch.get_device().

import torch
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer, get_device


def train_func():
    assert torch.cuda.is_available()

    device = get_device()
    assert device == torch.device("cuda:0")

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(
        num_workers=1,
        use_gpu=True
    )
)
trainer.fit()

Assigning multiple GPUs to a worker#

Sometimes you might want to allocate multiple GPUs for a worker. For example, you can specify resources_per_worker={"GPU": 2} in the ScalingConfig if you want to assign 2 GPUs for each worker.

You can get a list of associated devices with ray.train.torch.get_devices().

import torch
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer, get_device, get_devices


def train_func():
    assert torch.cuda.is_available()

    device = get_device()
    devices = get_devices()
    assert device == torch.device("cuda:0")
    assert devices == [torch.device("cuda:0"), torch.device("cuda:1")]

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(
        num_workers=1,
        use_gpu=True,
        resources_per_worker={"GPU": 2}
    )
)
trainer.fit()

Setting the GPU type#

Ray Train allows you to specify the accelerator type for each worker. This is useful if you want to use a specific accelerator type for model training. In a heterogeneous Ray cluster, this means that your training workers will be forced to run on the specified GPU type, rather than on any arbitrary GPU node. You can get a list of supported accelerator_type from the available accelerator types.

For example, you can specify accelerator_type="A100" in the ScalingConfig if you want to assign each worker a NVIDIA A100 GPU.

Tip

Ensure that your cluster has instances with the specified accelerator type or is able to autoscale to fulfill the request.

ScalingConfig(
    num_workers=1,
    use_gpu=True,
    accelerator_type="A100"
)

(PyTorch) Setting the communication backend#

PyTorch Distributed supports multiple backends for communicating tensors across workers. By default Ray Train will use NCCL when use_gpu=True and Gloo otherwise.

If you explictly want to override this setting, you can configure a TorchConfig and pass it into the TorchTrainer.

from ray.train.torch import TorchConfig, TorchTrainer

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(
        num_workers=num_training_workers,
        use_gpu=True, # Defaults to NCCL
    ),
    torch_config=TorchConfig(backend="gloo"),
)

(NCCL) Setting the communication network interface#

When using NCCL for distributed training, you can configure the network interface cards that are used for communicating between GPUs by setting the NCCL_SOCKET_IFNAME environment variable.

To ensure that the environment variable is set for all training workers, you can pass it in a Ray runtime environment:

import ray

runtime_env = {"env_vars": {"NCCL_SOCKET_IFNAME": "ens5"}}
ray.init(runtime_env=runtime_env)

trainer = TorchTrainer(...)

Setting the resources per worker#

If you want to allocate more than one CPU or GPU per training worker, or if you defined custom cluster resources, set the resources_per_worker attribute:

from ray.train import ScalingConfig

scaling_config = ScalingConfig(
    num_workers=8,
    resources_per_worker={
        "CPU": 4,
        "GPU": 2,
    },
    use_gpu=True,
)

Note

If you specify GPUs in resources_per_worker, you also need to set use_gpu=True.

You can also instruct Ray Train to use fractional GPUs. In that case, multiple workers will be assigned the same CUDA device.

from ray.train import ScalingConfig

scaling_config = ScalingConfig(
    num_workers=8,
    resources_per_worker={
        "CPU": 4,
        "GPU": 0.5,
    },
    use_gpu=True,
)

Trainer resources#

So far we’ve configured resources for each training worker. Technically, each training worker is a Ray Actor. Ray Train also schedules an actor for the Trainer object when you call Trainer.fit().

This object often only manages lightweight communication between the training workers. Per default, a trainer uses 1 CPU. If you have a cluster with 8 CPUs and want to start 4 training workers a 2 CPUs, this will not work, as the total number of required CPUs will be 9 (4 * 2 + 1). In that case, you can specify the trainer resources to use 0 CPUs:

from ray.train import ScalingConfig

scaling_config = ScalingConfig(
    num_workers=4,
    resources_per_worker={
        "CPU": 2,
    },
    trainer_resources={
        "CPU": 0,
    }
)