ray.train.ScalingConfig#

class ray.train.ScalingConfig(trainer_resources: dict | None = None, num_workers: int | Domain | Dict[str, List] = 1, use_gpu: bool | Domain | Dict[str, List] = False, resources_per_worker: Dict | Domain | Dict[str, List] | None = None, placement_strategy: str | Domain | Dict[str, List] = 'PACK', accelerator_type: str | None = None, use_tpu: bool = False, topology: str | None = None)#

Bases: ScalingConfig

Configuration for scaling training.

Parameters:
  • num_workers – The number of workers (Ray actors) to launch. Each worker will reserve 1 CPU by default. The number of CPUs reserved by each worker can be overridden with the resources_per_worker argument.

  • use_gpu – If True, training will be done on GPUs (1 per worker). Defaults to False. The number of GPUs reserved by each worker can be overridden with the resources_per_worker argument.

  • resources_per_worker – If specified, the resources defined in this Dict is reserved for each worker. Define the "CPU" and "GPU" keys (case-sensitive) to override the number of CPU or GPUs used by each worker.

  • placement_strategy – The placement strategy to use for the placement group of the Ray actors. See Placement Group Strategies for the possible options.

  • accelerator_type – [Experimental] If specified, Ray Train will launch the training coordinator and workers on the nodes with the specified type of accelerators. See the available accelerator types. Ensure that your cluster has instances with the specified accelerator type or is able to autoscale to fulfill the request. This field is required when use_tpu is True and num_workers is greater than 1.

  • use_tpu – [Experimental] If True, training will be done on TPUs (1 TPU VM per worker). Defaults to False. The number of TPUs reserved by each worker can be overridden with the resources_per_worker argument. This arg enables SPMD execution of the training workload.

  • topology – [Experimental] If specified, Ray Train will launch the training coordinator and workers on nodes with the specified topology. Topology is auto-detected for TPUs and added as Ray node labels. This arg enables SPMD execution of the training workload. This field is required when use_tpu is True and num_workers is greater than 1.

Example

from ray.train import ScalingConfig
scaling_config = ScalingConfig(
    # Number of distributed workers.
    num_workers=2,
    # Turn on/off GPU.
    use_gpu=True,
)

Methods

as_placement_group_factory

Returns a PlacementGroupFactory to specify resources for Tune.

from_placement_group_factory

Create a ScalingConfig from a Tune's PlacementGroupFactory

Attributes

accelerator_type

additional_resources_per_worker

Resources per worker, not including CPU or GPU resources.

num_cpus_per_worker

The number of CPUs to set per worker.

num_gpus_per_worker

The number of GPUs to set per worker.

num_tpus_per_worker

The number of TPUs to set per worker.

num_workers

placement_strategy

resources_per_worker

topology

total_resources

Map of total resources required for the trainer.

trainer_resources

use_gpu

use_tpu