Resource Allocation#

This guide helps you configure Ray Serve to:

Scale your deployments horizontally by specifying a number of replicas
Scale up and down automatically to react to changing traffic
Allocate hardware resources (CPUs, GPUs, other accelerators, etc) for each deployment

Resource management (CPUs, GPUs, accelerators)#

You may want to specify a deployment’s resource requirements to reserve cluster resources like GPUs or other accelerators. To assign hardware resources per replica, you can pass resource requirements to ray_actor_options. By default, each replica reserves one CPU. To learn about options to pass in, take a look at the Resources with Actors guide.

For example, to create a deployment where each replica uses a single GPU, you can do the following:

@serve.deployment(ray_actor_options={"num_gpus": 1})
def func(*args):
    return do_something_with_my_gpu()

Or if you want to create a deployment where each replica uses another type of accelerator such as an HPU, follow the example below:

@serve.deployment(ray_actor_options={"resources": {"HPU": 1}})
def func(*args):
    return do_something_with_my_hpu()

Fractional CPUs and fractional GPUs#

Suppose you have two models and each doesn’t fully saturate a GPU. You might want to have them share a GPU by allocating 0.5 GPUs each.

To do this, the resources specified in ray_actor_options can be fractional. For example, if you have two models and each doesn’t fully saturate a GPU, you might want to have them share a GPU by allocating 0.5 GPUs each.

@serve.deployment(ray_actor_options={"num_gpus": 0.5})
def func_1(*args):
    return do_something_with_my_gpu()

@serve.deployment(ray_actor_options={"num_gpus": 0.5})
def func_2(*args):
    return do_something_with_my_gpu()

In this example, each replica of each deployment will be allocated 0.5 GPUs. The same can be done to multiplex over CPUs, using "num_cpus".

Custom resources, accelerator types, and more#

You can also specify custom resources in ray_actor_options, for example to ensure that a deployment is scheduled on a specific node. For example, if you have a deployment that requires 2 units of the "custom_resource" resource, you can specify it like this:

@serve.deployment(ray_actor_options={"resources": {"custom_resource": 2}})
def func(*args):
    return do_something_with_my_custom_resource()

You can also specify accelerator types via the accelerator_type parameter in ray_actor_options.

Below is the full list of supported options in ray_actor_options; please see the relevant Ray Core documentation for more details about each option:

accelerator_type
memory
num_cpus
num_gpus
object_store_memory
resources
runtime_env

Configuring parallelism with OMP_NUM_THREADS#

Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference. The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable. Ray sets OMP_NUM_THREADS=<num_cpus> by default. To avoid contention, Ray sets OMP_NUM_THREADS=1 if num_cpus is not specified on the tasks/actors, to reduce contention between actors/tasks which run in a single thread. If you do want to enable this parallelism in your Serve deployment, just set num_cpus (recommended) to the desired value, or manually set the OMP_NUM_THREADS environment variable when starting Ray or in your function/class definition.

OMP_NUM_THREADS=12 ray start --head
OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS

@serve.deployment
class MyDeployment:
    def __init__(self, parallelism: str):
        os.environ["OMP_NUM_THREADS"] = parallelism
        # Download model weights, initialize model, etc.

    def __call__(self):
        pass


serve.run(MyDeployment.bind("12"))

Note

Some other libraries may not respect OMP_NUM_THREADS and have their own way to configure parallelism. For example, if you’re using OpenCV, you’ll need to manually set the number of threads using cv2.setNumThreads(num_threads) (set to 0 to disable multi-threading). You can check the configuration using cv2.getNumThreads() and cv2.getNumberOfCPUs().