Develop and Deploy an ML Application#
The flow for developing a Ray Serve application locally and deploying it in production covers the following steps:
Converting a Machine Learning model into a Ray Serve application
Testing the application locally
Building Serve config files for production deployment
Deploying applications using a config file
Convert a model into a Ray Serve application#
This example uses a text-translation model:
# File name: model.py
from transformers import pipeline
class Translator:
def __init__(self):
# Load model
self.model = pipeline("translation_en_to_fr", model="t5-small")
def translate(self, text: str) -> str:
# Run inference
model_output = self.model(text)
# Post-process output to return only the translation text
translation = model_output[0]["translation_text"]
return translation
translator = Translator()
translation = translator.translate("Hello world!")
print(translation)
The Python file, called model.py
, uses the Translator
class to translate English text to French.
The
self.model
variable inside theTranslator
’s__init__
method stores a function that uses the t5-small model to translate text.When
self.model
is called on English text, it returns translated French text inside a dictionary formatted as[{"translation_text": "..."}]
.The
Translator
’stranslate
method extracts the translated text by indexing into the dictionary.
Copy and paste the script and run it locally. It translates "Hello world!"
into "Bonjour Monde!"
.
$ python model.py
Bonjour Monde!
Converting this model into a Ray Serve application with FastAPI requires three changes:
Import Ray Serve and Fast API dependencies
Add decorators for Serve deployment with FastAPI:
@serve.deployment
and@serve.ingress(app)
bind
theTranslator
deployment to the arguments that are passed into its constructor
For other HTTP options, see Set Up FastAPI and HTTP.
import ray
from ray import serve
from fastapi import FastAPI
from transformers import pipeline
app = FastAPI()
@serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 0.2, "num_gpus": 0})
@serve.ingress(app)
class Translator:
def __init__(self):
# Load model
self.model = pipeline("translation_en_to_fr", model="t5-small")
@app.post("/")
def translate(self, text: str) -> str:
# Run inference
model_output = self.model(text)
# Post-process output to return only the translation text
translation = model_output[0]["translation_text"]
return translation
translator_app = Translator.bind()
Note that the code configures parameters for the deployment, such as num_replicas
and ray_actor_options
. These parameters help configure the number of copies of the deployment and the resource requirements for each copy. In this case, we set up 2 replicas of the model that take 0.2 CPUs and 0 GPUs each. For a complete guide on the configurable parameters on a deployment, see Configure a Serve deployment.
Test a Ray Serve application locally#
To test locally, run the script with the serve run
CLI command. This command takes in an import path formatted as module:application
. Run the command from a directory containing a local copy of the script saved as model.py
, so it can import the application:
$ serve run model:translator_app
This command runs the translator_app
application and then blocks streaming logs to the console. You can kill it with Ctrl-C
, which tears down the application.
Now test the model over HTTP. Reach it at the following default URL:
http://127.0.0.1:8000/
Send a POST request with JSON data containing the English text. This client script requests a translation for “Hello world!”:
# File name: model_client.py
import requests
response = requests.post("http://127.0.0.1:8000/", params={"text": "Hello world!"})
french_text = response.json()
print(french_text)
While a Ray Serve application is deployed, use the serve status
CLI command to check the status of the application and deployment. For more details on the output format of serve status
, see Inspect Serve in production.
$ serve status
proxies:
a85af35da5fcea04e13375bdc7d2c83c7d3915e290f1b25643c55f3a: HEALTHY
applications:
default:
status: RUNNING
message: ''
last_deployed_time_s: 1693428451.894696
deployments:
Translator:
status: HEALTHY
replica_states:
RUNNING: 2
message: ''
Build Serve config files for production deployment#
To deploy Serve applications in production, you need to generate a Serve config YAML file. A Serve config file is the single source of truth for the cluster, allowing you to specify system-level configuration and your applications in one place. It also allows you to declaratively update your applications. The serve build
CLI command takes as input the import path and saves to an output file using the -o
flag. You can specify all deployment parameters in the Serve config files.
$ serve build model:translator_app -o config.yaml
The serve build
command adds a default application name that can be modified. The resulting Serve config file is:
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: 8000
grpc_options:
port: 9000
grpc_servicer_functions: []
applications:
- name: app1
route_prefix: /
import_path: model:translator_app
runtime_env: {}
deployments:
- name: Translator
num_replicas: 2
ray_actor_options:
num_cpus: 0.2
num_gpus: 0.0
You can also use the Serve config file with serve run
for local testing. For example:
$ serve run config.yaml
$ serve status
proxies:
1894261b372d34854163ac5ec88405328302eb4e46ac3a2bdcaf8d18: HEALTHY
applications:
app1:
status: RUNNING
message: ''
last_deployed_time_s: 1693430474.873806
deployments:
Translator:
status: HEALTHY
replica_states:
RUNNING: 2
message: ''
For more details, see Serve Config Files.
Deploy Ray Serve in production#
Deploy the Ray Serve application in production on Kubernetes using the KubeRay operator. Copy the YAML file generated in the previous step directly into the Kubernetes configuration. KubeRay supports zero-downtime upgrades, status reporting, and fault tolerance for your production application. See Deploying on Kubernetes for more information. For production usage, consider implementing the recommended practice of setting up head node fault tolerance.
Monitor Ray Serve#
Use the Ray Dashboard to get a high-level overview of your Ray Cluster and Ray Serve application’s states. The Ray Dashboard is available both during local testing and on a remote cluster in production. Ray Serve provides some in-built metrics and logging as well as utilities for adding custom metrics and logs in your application. For production deployments, exporting logs and metrics to your observability platforms is recommended. See Monitoring for more details.