Develop and Deploy an ML Application#

The flow for developing a Ray Serve application locally and deploying it in production covers the following steps:

Converting a Machine Learning model into a Ray Serve application
Testing the application locally
Building Serve config files for production deployment
Deploying applications using a config file

Convert a model into a Ray Serve application#

This example uses a text-translation model:

# File name: model.py
from transformers import pipeline


class Translator:
    def __init__(self):
        # Load model
        self.model = pipeline("translation_en_to_fr", model="t5-small")

    def translate(self, text: str) -> str:
        # Run inference
        model_output = self.model(text)

        # Post-process output to return only the translation text
        translation = model_output[0]["translation_text"]

        return translation


translator = Translator()

translation = translator.translate("Hello world!")
print(translation)

The Python file, called model.py, uses the Translator class to translate English text to French.

The self.model variable inside the Translator’s __init__ method stores a function that uses the t5-small model to translate text.
When self.model is called on English text, it returns translated French text inside a dictionary formatted as [{"translation_text": "..."}].
The Translator’s translate method extracts the translated text by indexing into the dictionary.

Copy and paste the script and run it locally. It translates "Hello world!" into "Bonjour Monde!".

$ python model.py

Bonjour Monde!

Converting this model into a Ray Serve application with FastAPI requires three changes:

Import Ray Serve and FastAPI dependencies
Add decorators for Serve deployment with FastAPI: @serve.deployment and @serve.ingress(app)
bind the Translator deployment to the arguments that are passed into its constructor

For other HTTP options, see Set Up FastAPI and HTTP.

import ray
from ray import serve
from fastapi import FastAPI

from transformers import pipeline

app = FastAPI()


@serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 0.2, "num_gpus": 0})
@serve.ingress(app)
class Translator:
    def __init__(self):
        # Load model
        self.model = pipeline("translation_en_to_fr", model="t5-small")

    @app.post("/")
    def translate(self, text: str) -> str:
        # Run inference
        model_output = self.model(text)

        # Post-process output to return only the translation text
        translation = model_output[0]["translation_text"]

        return translation


translator_app = Translator.bind()

Note that the code configures parameters for the deployment, such as num_replicas and ray_actor_options. These parameters help configure the number of copies of the deployment and the resource requirements for each copy. In this case, we set up 2 replicas of the model that take 0.2 CPUs and 0 GPUs each. For a complete guide on the configurable parameters on a deployment, see Configure a Serve deployment.

Test a Ray Serve application locally#

To test locally, run the script with the serve run CLI command. This command takes in an import path formatted as module:application. Run the command from a directory containing a local copy of the script saved as model.py, so it can import the application:

$ serve run model:translator_app

This command runs the translator_app application and then blocks, streaming logs to the console. You can kill it with Ctrl-C, which tears down the application.

Now test the model over HTTP. Reach it at the following default URL:

http://127.0.0.1:8000/

Send a POST request with JSON data containing the English text. This client script requests a translation for “Hello world!”:

# File name: model_client.py
import requests

response = requests.post("http://127.0.0.1:8000/", params={"text": "Hello world!"})
french_text = response.json()

print(french_text)

While a Ray Serve application is deployed, use the serve status CLI command to check the status of the application and deployment. For more details on the output format of serve status, see Inspect Serve in production.

$ serve status
proxies:
  a85af35da5fcea04e13375bdc7d2c83c7d3915e290f1b25643c55f3a: HEALTHY
applications:
  default:
    status: RUNNING
    message: ''
    last_deployed_time_s: 1693428451.894696
    deployments:
      Translator:
        status: HEALTHY
        replica_states:
          RUNNING: 2
        message: ''

Build Serve config files for production deployment#

To deploy Serve applications in production, you need to generate a Serve config YAML file. A Serve config file is the single source of truth for the cluster, allowing you to specify system-level configuration and your applications in one place. It also allows you to declaratively update your applications. The serve build CLI command takes as input the import path and saves to an output file using the -o flag. You can specify all deployment parameters in the Serve config files.

$ serve build model:translator_app -o config.yaml

The serve build command adds a default application name that can be modified. The resulting Serve config file is:

proxy_location: EveryNode

http_options:
  host: 0.0.0.0
  port: 8000

grpc_options:
  port: 9000
  grpc_servicer_functions: []

applications:

- name: app1
  route_prefix: /
  import_path: model:translator_app
  runtime_env: {}
  deployments:
  - name: Translator
    num_replicas: 2
    ray_actor_options:
      num_cpus: 0.2
      num_gpus: 0.0

You can also use the Serve config file with serve run for local testing. For example:

$ serve run config.yaml

$ serve status
proxies:
  1894261b372d34854163ac5ec88405328302eb4e46ac3a2bdcaf8d18: HEALTHY
applications:
  app1:
    status: RUNNING
    message: ''
    last_deployed_time_s: 1693430474.873806
    deployments:
      Translator:
        status: HEALTHY
        replica_states:
          RUNNING: 2
        message: ''

For more details, see Serve Config Files.

Deploy Ray Serve in production#

Deploy the Ray Serve application in production on Kubernetes using the KubeRay operator. Copy the YAML file generated in the previous step directly into the Kubernetes configuration. KubeRay supports zero-downtime upgrades, status reporting, and fault tolerance for your production application. See Deploying on Kubernetes for more information. For production usage, consider implementing the recommended practice of setting up head node fault tolerance.

Monitor Ray Serve#

Use the Ray Dashboard to get a high-level overview of your Ray Cluster and Ray Serve application’s states. The Ray Dashboard is available both during local testing and on a remote cluster in production. Ray Serve provides some in-built metrics and logging as well as utilities for adding custom metrics and logs in your application. For production deployments, exporting logs and metrics to your observability platforms is recommended. See Monitoring for more details.