ray.train.ScalingConfig#
- class ray.train.ScalingConfig(trainer_resources: Dict | Domain | Dict[str, List] | None = None, num_workers: int | Domain | Dict[str, List] = 1, use_gpu: bool | Domain | Dict[str, List] = False, resources_per_worker: Dict | Domain | Dict[str, List] | None = None, placement_strategy: str | Domain | Dict[str, List] = 'PACK', accelerator_type: str | None = None)#
Configuration for scaling training.
- Parameters:
trainer_resources – Resources to allocate for the training coordinator. The training coordinator launches the worker group and executes the training function per worker, and this process does NOT require GPUs. The coordinator is always scheduled on the same node as the rank 0 worker, so one example use case is to set a minimum amount of resources (e.g. CPU memory) required by the rank 0 node. By default, this assigns 1 CPU to the training coordinator.
num_workers – The number of workers (Ray actors) to launch. Each worker will reserve 1 CPU by default. The number of CPUs reserved by each worker can be overridden with the
resources_per_worker
argument.use_gpu – If True, training will be done on GPUs (1 per worker). Defaults to False. The number of GPUs reserved by each worker can be overridden with the
resources_per_worker
argument.resources_per_worker – If specified, the resources defined in this Dict is reserved for each worker. Define the
"CPU"
and"GPU"
keys (case-sensitive) to override the number of CPU or GPUs used by each worker.placement_strategy – The placement strategy to use for the placement group of the Ray actors. See Placement Group Strategies for the possible options.
accelerator_type – [Experimental] If specified, Ray Train will launch the training coordinator and workers on the nodes with the specified type of accelerators. See the available accelerator types. Ensure that your cluster has instances with the specified accelerator type or is able to autoscale to fulfill the request.
Example
from ray.train import ScalingConfig scaling_config = ScalingConfig( # Number of distributed workers. num_workers=2, # Turn on/off GPU. use_gpu=True, # Specify resources used for trainer. trainer_resources={"CPU": 1}, # Try to schedule workers on different nodes. placement_strategy="SPREAD", )
Methods
Returns a PlacementGroupFactory to specify resources for Tune.
Create a ScalingConfig from a Tune's PlacementGroupFactory
Attributes
Resources per worker, not including CPU or GPU resources.
The number of CPUs to set per worker.
The number of GPUs to set per worker.
Map of total resources required for the trainer.