The recommended way to run Ray Serve in production is on Kubernetes using the KubeRay RayService custom resource. The RayService custom resource automatically handles important production requirements such as health checking, status reporting, failure recovery, and upgrades. If you’re not running on Kubernetes, you can also run Ray Serve on a Ray cluster directly using the Serve CLI.
This section will walk you through a quickstart of how to generate a Serve config file and deploy it using the Serve CLI. For more details, you can check out the other pages in the production guide:
Understand the Serve config file format.
Understand how to deploy on Kubernetes using KubeRay.
Understand how to monitor running Serve applications.
For deploying on VMs instead of Kubernetes, see Deploy on VM.
Working example: Text summarization and translation application#
Throughout the production guide, we will use the following Serve application as a working example. The application takes in a string of text in English, then summarizes and translates it into French (default), German, or Romanian.
from starlette.requests import Request from typing import Dict import ray from ray import serve from ray.serve.handle import RayServeHandle from transformers import pipeline @serve.deployment class Translator: def __init__(self): self.language = "french" self.model = pipeline("translation_en_to_fr", model="t5-small") def translate(self, text: str) -> str: model_output = self.model(text) translation = model_output["translation_text"] return translation def reconfigure(self, config: Dict): self.language = config.get("language", "french") if self.language.lower() == "french": self.model = pipeline("translation_en_to_fr", model="t5-small") elif self.language.lower() == "german": self.model = pipeline("translation_en_to_de", model="t5-small") elif self.language.lower() == "romanian": self.model = pipeline("translation_en_to_ro", model="t5-small") else: pass @serve.deployment class Summarizer: def __init__(self, translator: RayServeHandle): # Load model self.model = pipeline("summarization", model="t5-small") self.translator = translator self.min_length = 5 self.max_length = 15 def summarize(self, text: str) -> str: # Run inference model_output = self.model( text, min_length=self.min_length, max_length=self.max_length ) # Post-process output to return only the summary text summary = model_output["summary_text"] return summary async def __call__(self, http_request: Request) -> str: english_text: str = await http_request.json() summary = self.summarize(english_text) translation_ref = await self.translator.translate.remote(summary) translation = await translation_ref return translation def reconfigure(self, config: Dict): self.min_length = config.get("min_length", 5) self.max_length = config.get("max_length", 15) app = Summarizer.bind(Translator.bind())
Save this code locally in
In development, we would likely use the
serve run command to iteratively run, develop, and repeat (see the Development Workflow for more information).
When we’re ready to go to production, we will generate a structured config file that acts as the single source of truth for the application.
This config file can be generated using
$ serve build text_ml:app -o serve_config.yaml
The generated version of this file contains an
runtime_env, and configuration options for each deployment in the application.
The application needs the
transformers packages, so modify the
runtime_env field of the generated config to include these two pip packages. Save this config locally in
proxy_location: EveryNode http_options: host: 0.0.0.0 port: 8000 applications: - name: default route_prefix: / import_path: text_ml:app runtime_env: pip: - torch - transformers deployments: - name: Translator num_replicas: 1 user_config: language: french - name: Summarizer num_replicas: 1
You can use
serve deploy to deploy the application to a local Ray cluster and
serve status to get the status at runtime:
# Start a local Ray cluster. ray start --head # Deploy the Text ML application to the local Ray cluster. serve deploy serve_config.yaml 2022-08-16 12:51:22,043 SUCC scripts.py:180 -- Sent deploy request successfully! * Use `serve status` to check deployments' statuses. * Use `serve config` to see the running app's config. $ serve status proxies: cef533a072b0f03bf92a6b98cb4eb9153b7b7c7b7f15954feb2f38ec: HEALTHY applications: default: status: RUNNING message: '' last_deployed_time_s: 1694041157.2211847 deployments: Translator: status: HEALTHY replica_states: RUNNING: 1 message: '' Summarizer: status: HEALTHY replica_states: RUNNING: 1 message: ''
Test the application using Python
import requests english_text = ( "It was the best of times, it was the worst of times, it was the age " "of wisdom, it was the age of foolishness, it was the epoch of belief" ) response = requests.post("http://127.0.0.1:8000/", json=english_text) french_text = response.text print(french_text) # 'c'était le meilleur des temps, c'était le pire des temps .'
To update the application, modify the config file and use
serve deploy again.
For a deeper dive into how to deploy, update, and monitor Serve applications, see the following pages: