Resources#
Ray allows you to seamlessly scale your applications from a laptop to a cluster without code change. Ray resources are key to this capability. They abstract away physical machines and let you express your computation in terms of resources, while the system manages scheduling and autoscaling based on resource requests.
A resource in Ray is a key-value pair where the key denotes a resource name, and the value is a float quantity. For convenience, Ray has native support for CPU, GPU, and memory resource types; CPU, GPU and memory are called pre-defined resources. Besides those, Ray also supports custom resources.
Physical Resources and Logical Resources#
Physical resources are resources that a machine physically has such as physical CPUs and GPUs and logical resources are virtual resources defined by a system.
Ray resources are logical and don’t need to have 1-to-1 mapping with physical resources.
For example, you can start a Ray head node with 0 logical CPUs via ray start --head --num-cpus=0
even if it physically has eight
(This signals the Ray scheduler to not schedule any tasks or actors that require logical CPU resources
on the head node, mainly to reserve the head node for running Ray system processes.).
They are mainly used for admission control during scheduling.
The fact that resources are logical has several implications:
Resource requirements of tasks or actors do NOT impose limits on actual physical resource usage. For example, Ray doesn’t prevent a
num_cpus=1
task from launching multiple threads and using multiple physical CPUs. It’s your responsibility to make sure tasks or actors use no more resources than specified via resource requirements.Ray doesn’t provide CPU isolation for tasks or actors. For example, Ray won’t reserve a physical CPU exclusively and pin a
num_cpus=1
task to it. Ray will let the operating system schedule and run the task instead. If needed, you can use operating system APIs likesched_setaffinity
to pin a task to a physical CPU.Ray does provide GPU isolation in the form of visible devices by automatically setting the
CUDA_VISIBLE_DEVICES
environment variable, which most ML frameworks will respect for purposes of GPU assignment.
Note
Ray sets the environment variable OMP_NUM_THREADS=<num_cpus>
if num_cpus
is set on
the task/actor via ray.remote()
and task.options()
/actor.options()
.
Ray sets OMP_NUM_THREADS=1
if num_cpus
is not specified; this
is done to avoid performance degradation with many workers (issue #6998). You can
also override this by explicitly setting OMP_NUM_THREADS
to override anything Ray sets by default.
OMP_NUM_THREADS
is commonly used in numpy, PyTorch, and Tensorflow to perform multi-threaded
linear algebra. In multi-worker setting, we want one thread per worker instead of many threads
per worker to avoid contention. Some other libraries may have their own way to configure
parallelism. For example, if you’re using OpenCV, you should manually set the number of
threads using cv2.setNumThreads(num_threads) (set to 0 to disable multi-threading).
Custom Resources#
Besides pre-defined resources, you can also specify a Ray node’s custom resources and request them in your tasks or actors. Some use cases for custom resources:
Your node has special hardware and you can represent it as a custom resource. Then your tasks or actors can request the custom resource via
@ray.remote(resources={"special_hardware": 1})
and Ray will schedule the tasks or actors to the node that has the custom resource.You can use custom resources as labels to tag nodes and you can achieve label based affinity scheduling. For example, you can do
ray.remote(resources={"custom_label": 0.001})
to schedule tasks or actors to nodes withcustom_label
custom resource. For this use case, the actual quantity doesn’t matter, and the convention is to specify a tiny number so that the label resource is not the limiting factor for parallelism.
Specifying Node Resources#
By default, Ray nodes start with pre-defined CPU, GPU, and memory resources. The quantities of these logical resources on each node are set to the physical quantities auto detected by Ray. By default, logical resources are configured by the following rule.
Warning
Ray does not permit dynamic updates of resource capacities after Ray has been started on a node.
Number of logical CPUs (``num_cpus``): Set to the number of CPUs of the machine/container.
Number of logical GPUs (``num_gpus``): Set to the number of GPUs of the machine/container.
Memory (``memory``): Set to 70% of “available memory” when ray runtime starts.
Object Store Memory (``object_store_memory``): Set to 30% of “available memory” when ray runtime starts. Note that the object store memory is not logical resource, and users cannot use it for scheduling.
However, you can always override that by manually specifying the quantities of pre-defined resources and adding custom resources. There are several ways to do that depending on how you start the Ray cluster:
If you are using ray.init()
to start a single node Ray cluster, you can do the following to manually specify node resources:
# This will start a Ray node with 3 logical cpus, 4 logical gpus,
# 1 special_hardware resource and 1 custom_label resource.
ray.init(num_cpus=3, num_gpus=4, resources={"special_hardware": 1, "custom_label": 1})
If you are using ray start to start a Ray node, you can run:
ray start --head --num-cpus=3 --num-gpus=4 --resources='{"special_hardware": 1, "custom_label": 1}'
If you are using ray up to start a Ray cluster, you can set the resources field in the yaml file:
available_node_types:
head:
...
resources:
CPU: 3
GPU: 4
special_hardware: 1
custom_label: 1
If you are using KubeRay to start a Ray cluster, you can set the rayStartParams field in the yaml file:
headGroupSpec:
rayStartParams:
num-cpus: "3"
num-gpus: "4"
resources: '"{\"special_hardware\": 1, \"custom_label\": 1}"'
Specifying Task or Actor Resource Requirements#
Ray allows specifying a task or actor’s logical resource requirements (e.g., CPU, GPU, and custom resources). The task or actor will only run on a node if there are enough required logical resources available to execute the task or actor.
By default, Ray tasks use 1 logical CPU resource and Ray actors use 1 logical CPU for scheduling, and 0 logical CPU for running.
(This means, by default, actors cannot get scheduled on a zero-cpu node, but an infinite number of them can run on any non-zero cpu node.
The default resource requirements for actors was chosen for historical reasons.
It’s recommended to always explicitly set num_cpus
for actors to avoid any surprises.
If resources are specified explicitly, they are required both at schedule time and at execution time.)
You can also explicitly specify a task’s or actor’s logical resource requirements (for example, one task may require a GPU) instead of using default ones via ray.remote()
and task.options()
/actor.options()
.
# Specify the default resource requirements for this remote function.
@ray.remote(num_cpus=2, num_gpus=2, resources={"special_hardware": 1})
def func():
return 1
# You can override the default resource requirements.
func.options(num_cpus=3, num_gpus=1, resources={"special_hardware": 0}).remote()
@ray.remote(num_cpus=0, num_gpus=1)
class Actor:
pass
# You can override the default resource requirements for actors as well.
actor = Actor.options(num_cpus=1, num_gpus=0).remote()
// Specify required resources.
Ray.task(MyRayApp::myFunction).setResource("CPU", 1.0).setResource("GPU", 1.0).setResource("special_hardware", 1.0).remote();
Ray.actor(Counter::new).setResource("CPU", 2.0).setResource("GPU", 1.0).remote();
// Specify required resources.
ray::Task(MyFunction).SetResource("CPU", 1.0).SetResource("GPU", 1.0).SetResource("special_hardware", 1.0).Remote();
ray::Actor(CreateCounter).SetResource("CPU", 2.0).SetResource("GPU", 1.0).Remote();
Task and actor resource requirements have implications for the Ray’s scheduling concurrency. In particular, the sum of the logical resource requirements of all of the concurrently executing tasks and actors on a given node cannot exceed the node’s total logical resources. This property can be used to limit the number of concurrently running tasks or actors to avoid issues like OOM.
Fractional Resource Requirements#
Ray supports fractional resource requirements.
For example, if your task or actor is IO bound and has low CPU usage, you can specify fractional CPU num_cpus=0.5
or even zero CPU num_cpus=0
.
The precision of the fractional resource requirement is 0.0001 so you should avoid specifying a double that’s beyond that precision.
@ray.remote(num_cpus=0.5)
def io_bound_task():
import time
time.sleep(1)
return 2
io_bound_task.remote()
@ray.remote(num_gpus=0.5)
class IOActor:
def ping(self):
import os
print(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
# Two actors can share the same GPU.
io_actor1 = IOActor.remote()
io_actor2 = IOActor.remote()
ray.get(io_actor1.ping.remote())
ray.get(io_actor2.ping.remote())
# Output:
# (IOActor pid=96328) CUDA_VISIBLE_DEVICES: 1
# (IOActor pid=96329) CUDA_VISIBLE_DEVICES: 1
Note
GPU, TPU, and neuron_cores resource requirements that are greater than 1, need to be whole numbers. For example, num_gpus=1.5
is invalid.
Tip
Besides resource requirements, you can also specify an environment for a task or actor to run in, which can include Python packages, local files, environment variables, and more. See Runtime Environments for details.