End-to-End Tutorial¶
By the end of this tutorial you will have learned how to deploy a machine learning model locally via Ray Serve.
First, install Ray Serve and all of its dependencies by running the following command in your terminal:
$ pip install "ray[serve]"
For this tutorial, we’ll use HuggingFace’s SummarizationPipeline to access a model that summarizes text.
Example Model¶
Let’s first take a look at how the model works, without using Ray Serve. This is the code for the model:
1# File name: local_model.py
2from transformers import pipeline
3
4
5def summarize(text):
6 # Load model
7 summarizer = pipeline("summarization", model="t5-small")
8
9 # Run inference
10 summary_list = summarizer(text)
11
12 # Post-process output to return only the summary text
13 summary = summary_list[0]["summary_text"]
14
15 return summary
16
17
18article_text = (
19 "HOUSTON -- Men have landed and walked on the moon. "
20 "Two Americans, astronauts of Apollo 11, steered their fragile "
21 "four-legged lunar module safely and smoothly to the historic landing "
22 "yesterday at 4:17:40 P.M., Eastern daylight time. Neil A. Armstrong, the "
23 "38-year-old commander, radioed to earth and the mission control room "
24 'here: "Houston, Tranquility Base here. The Eagle has landed." The '
25 "first men to reach the moon -- Armstrong and his co-pilot, Col. Edwin E. "
26 "Aldrin Jr. of the Air Force -- brought their ship to rest on a level, "
27 "rock-strewn plain near the southwestern shore of the arid Sea of "
28 "Tranquility. About six and a half hours later, Armstrong opened the "
29 "landing craft's hatch, stepped slowly down the ladder and declared as "
30 "he planted the first human footprint on the lunar crust: \"That's one "
31 'small step for man, one giant leap for mankind." His first step on the '
32 "moon came at 10:56:20 P.M., as a television camera outside the craft "
33 "transmitted his every move to an awed and excited audience of hundreds "
34 "of millions of people on earth."
35)
36
37summary = summarize(article_text)
38print(summary)
The Python file, called local_model.py
uses the summarize
function to
generate summaries of text.
The
summarizer
variable on line 7 insidesummarize
points to a function that uses the t5-small model to summarize text.When
summarizer
is called on a Python String, it returns summarized text inside a dictionary formatted as[{"summary_text": "...", ...}, ...]
.summarize
then extracts the summarized text on line 13 by indexing into the dictionary.
The file can be run locally by executing the Python script, which uses the model to summarize an article about the Apollo 11 moon landing 1.
$ python local_model.py
"two astronauts steered their fragile lunar module safely and smoothly to the
historic landing . the first men to reach the moon -- Armstrong and his
co-pilot, col. Edwin E. Aldrin Jr. of the air force -- brought their ship to
rest on a level, rock-strewn plain ."
Keep in mind that the SummarizationPipeline
is an example machine learning
model for this tutorial. You can follow along using arbitrary models in any
framework that has a Python API. Check out our tutorials on sckit-learn,
PyTorch, and Tensorflow for more info and examples:
Converting to Ray Serve Deployment¶
This tutorial’s goal is to deploy this model using Ray Serve, so it can be scaled up and queried over HTTP. We’ll start by converting the above Python function into a Ray Serve deployment that can be launched locally on a laptop.
We start by opening a new Python file. First, we need to import ray
and
ray serve
, to use features in Ray Serve such as deployments
, which
provide HTTP access to our model.
import ray
from ray import serve
After these imports, we can include our model code from above.
We won’t call our summarize
function just yet though!
We will soon add logic to handle HTTP requests, so the summarize
function
can operate on article text sent via HTTP request.
from transformers import pipeline
def summarize(text):
summarizer = pipeline("summarization", model="t5-small")
summary_list = summarizer(text)
summary = summary_list[0]["summary_text"]
return summary
Ray Serve needs to run on top of a Ray cluster, so we connect to a local one. See Deploying Ray Serve to learn more about starting a Ray Serve instance and deploying to a Ray cluster.
ray.init(address="auto", namespace="serve")
The address
parameter in ray.init()
connects your Serve script to a
running local Ray cluster. Later, we’ll discuss how to start a local Ray
cluster.
Note
ray.init()
connects to or starts a single-node Ray cluster on your
local machine, which allows you to use all your CPU cores to serve
requests in parallel. To start a multi-node cluster, see
Deploying Ray Serve.
Next, we start the Ray Serve runtime:
serve.start(detached=True)
Note
detached=True
means Ray Serve will continue running even when the Python
script exits. If you would rather stop Ray Serve after the script exits, use
serve.start()
instead (see Lifetime of a Ray Serve Instance for
details).
Now that we have defined our summarize
function, connected to a Ray
Cluster, and started the Ray Serve runtime, we can define a function that
accepts HTTP requests and routes them to the summarize
function. We
define a function called router
that takes in a Starlette request
object 2:
1@serve.deployment
2def router(request):
3 txt = request.query_params["txt"]
4 return summarize(txt)
5
6
In line 1, we add the decorator
@serve.deployment
to therouter
function to turn the function into a ServeDeployment
object.In line 3,
router
uses the"txt"
query parameter in therequest
to get the article text to summarize.In line 4, it then passes this article text into the
summarize
function and returns the value.
Note
Lines 3 and 4 define our HTTP request schema. The HTTP requests sent to this
endpoint must have a "txt"
query parameter that contains a string.
In general, you can accept HTTP data using query parameters or the
request body. Additionally, you can add other Serve deployments with
different names to create more endpoints that can accept different schemas.
For more complex validation, you can also use FastAPI (see
FastAPI HTTP Deployments for more info).
Tip
This routing function’s name doesn’t have to be router
.
It can be any function name as long as the corresponding name is present in
the HTTP request. If you want the function name to be different than the name
in the HTTP request, you can add the name
keyword parameter to the
@serve.deployment
decorator to specify the name sent in the HTTP request.
For example, if the decorator is @serve.deployment(name="responder")
and
the function signature is def request_manager(request)
, the HTTP request
should use responder
, not request_manager
. If no name
is passed
into @serve.deployment
, the request
uses the function’s name by
default. For example, if the decorator is @serve.deployment
and the
function’s signature is def manager(request)
, the HTTP request should use
manager
.
Since @serve.deployment
makes router
a Deployment
object, it can be
deployed using router.deploy()
:
router.deploy()
Once we deploy router
, we can query the model over HTTP.
With that, we can run our model on Ray Serve!
Here’s the full Ray Serve deployment script that we built for our model:
1# File name: model_on_ray_serve.py
2import ray
3from ray import serve
4from transformers import pipeline
5
6
7def summarize(text):
8 summarizer = pipeline("summarization", model="t5-small")
9 summary_list = summarizer(text)
10 summary = summary_list[0]["summary_text"]
11 return summary
12
13
14ray.init(address="auto", namespace="serve")
15serve.start(detached=True)
16
17
18@serve.deployment
19def router(request):
20 txt = request.query_params["txt"]
21 return summarize(txt)
22
23
24router.deploy()
To deploy router
, we first start a local Ray cluster:
$ ray start --head
The Ray cluster that this command launches is the same Ray cluster that the
Python code connects to using ray.init(address="auto", namespace="serve")
.
It is also the same Ray cluster that keeps Ray Serve (and any deployments on
it, such as router
) alive even after the Python script exits as long as
detached=True
inside serve.start()
.
Tip
To stop the Ray cluster, run the command ray stop
.
After starting the Ray cluster, we can run the Python file to deploy router
and begin accepting HTTP requests:
$ python model_on_ray_serve.py
Testing the Ray Serve Deployment¶
We can now test our model over HTTP. The structure of our HTTP query is:
http://127.0.0.1:8000/[Deployment Name]?[Parameter Name-1]=[Parameter Value-1]&[Parameter Name-2]=[Parameter Value-2]&...&[Parameter Name-n]=[Parameter Value-n]
Since the cluster is deployed locally in this tutorial, the 127.0.0.1:8000
refers to a localhost with port 8000. The [Deployment Name]
refers to
either the name of the function that we called .deploy()
on (in our case,
this is router
), or the name
keyword parameter’s value in
@serve.deployment
(see the Tip under the router
function definition
above for more info).
Each [Parameter Name]
refers to a field’s name in the
request’s query_params
dictionary for our deployed function. In our
example, the only parameter we need to pass in is txt
. This parameter is
referenced in the txt = request.query_params["txt"]
line in the router
function. Each [Parameter Name] object has a corresponding [Parameter Value]
object. The txt
’s [Parameter Value] is a string containing the article
text to summarize. We can chain together any number of the name-value pairs
using the &
symbol in the request URL.
Now that the summarize
function is deployed on Ray Serve, we can make HTTP
requests to it. Here’s a client script that requests a summary from the same
article as the original Python script:
# File name: router_client.py
import requests
article_text = (
"HOUSTON -- Men have landed and walked on the moon. "
"Two Americans, astronauts of Apollo 11, steered their fragile "
"four-legged lunar module safely and smoothly to the historic landing "
"yesterday at 4:17:40 P.M., Eastern daylight time. Neil A. Armstrong, the "
"38-year-old commander, radioed to earth and the mission control room "
'here: "Houston, Tranquility Base here. The Eagle has landed." The '
"first men to reach the moon -- Armstrong and his co-pilot, Col. Edwin E. "
"Aldrin Jr. of the Air Force -- brought their ship to rest on a level, "
"rock-strewn plain near the southwestern shore of the arid Sea of "
"Tranquility. About six and a half hours later, Armstrong opened the "
"landing craft's hatch, stepped slowly down the ladder and declared as "
"he planted the first human footprint on the lunar crust: \"That's one "
'small step for man, one giant leap for mankind." His first step on the '
"moon came at 10:56:20 P.M., as a television camera outside the craft "
"transmitted his every move to an awed and excited audience of hundreds "
"of millions of people on earth."
)
response = requests.get("http://127.0.0.1:8000/router?txt=" + article_text).text
print(response)
We can run this script while the model is deployed to get a response over HTTP:
$ python router_client.py
"two astronauts steered their fragile lunar module safely and smoothly to the
historic landing . the first men to reach the moon -- Armstrong and his
co-pilot, col. Edwin E. Aldrin Jr. of the air force -- brought their ship to
rest on a level, rock-strewn plain ."
Using Classes in the Ray Serve Deployment¶
Our application is still a bit inefficient though. In particular, the
summarize
function loads the model on each call when it sets the
summarizer
variable. However, the model never changes, so it would be more
efficient to define summarizer
only once and keep its value in memory
instead of reloading it for each HTTP query.
We can achieve this by converting our summarize
function into a class:
1# File name: summarizer_on_ray_serve.py
2import ray
3from ray import serve
4from transformers import pipeline
5
6ray.init(address="auto", namespace="serve")
7serve.start(detached=True)
8
9
10@serve.deployment
11class Summarizer:
12 def __init__(self):
13 self.summarize = pipeline("summarization", model="t5-small")
14
15 def __call__(self, request):
16 txt = request.query_params["txt"]
17 summary_list = self.summarize(txt)
18 summary = summary_list[0]["summary_text"]
19 return summary
20
21
22Summarizer.deploy()
In this configuration, we can query the Summarizer
class directly.
The Summarizer
is initialized once (after calling Summarizer.deploy()
).
In line 13, its __init__
function loads and stores the model in
self.summarize
. HTTP queries for the Summarizer
class are routed to its
__call__
method by default, which takes in the Starlette request
object. The Summarizer
class can then take the request’s txt
data and
call the self.summarize
function on it without loading the model on each
query.
Tip
Instance variables can also store state. For example, to
count the number of requests served, a @serve.deployment
class can define
a self.counter
instance variable in its __init__
function and set it
to 0. When the class is queried, it can increment the self.counter
variable inside of the function responding to the query. The self.counter
will keep track of the number of requests served across requests.
HTTP queries for the Ray Serve class deployments follow a similar format to Ray
Serve function deployments. Here’s an example client script for the
Summarizer
class. Notice that the only difference from the router
’s
client script is that the URL uses the Summarizer
path instead of
router
.
# File name: summarizer_client.py
import requests
article_text = (
"HOUSTON -- Men have landed and walked on the moon. "
"Two Americans, astronauts of Apollo 11, steered their fragile "
"four-legged lunar module safely and smoothly to the historic landing "
"yesterday at 4:17:40 P.M., Eastern daylight time. Neil A. Armstrong, the "
"38-year-old commander, radioed to earth and the mission control room "
'here: "Houston, Tranquility Base here. The Eagle has landed." The '
"first men to reach the moon -- Armstrong and his co-pilot, Col. Edwin E. "
"Aldrin Jr. of the Air Force -- brought their ship to rest on a level, "
"rock-strewn plain near the southwestern shore of the arid Sea of "
"Tranquility. About six and a half hours later, Armstrong opened the "
"landing craft's hatch, stepped slowly down the ladder and declared as "
"he planted the first human footprint on the lunar crust: \"That's one "
'small step for man, one giant leap for mankind." His first step on the '
"moon came at 10:56:20 P.M., as a television camera outside the craft "
"transmitted his every move to an awed and excited audience of hundreds "
"of millions of people on earth."
)
response = requests.get("http://127.0.0.1:8000/Summarizer?txt=" + article_text).text
print(response)
We can deploy the class-based model on Serve without stopping the Ray cluster. However, for the purposes of this tutorial, let’s restart the cluster, deploy the model, and query it over HTTP:
$ ray stop
$ ray start --head
$ python summarizer_on_ray_serve.py
$ python summarizer_client.py
"two astronauts steered their fragile lunar module safely and smoothly to the
historic landing . the first men to reach the moon -- Armstrong and his
co-pilot, col. Edwin E. Aldrin Jr. of the air force -- brought their ship to
rest on a level, rock-strewn plain ."
Adding Functionality with FastAPI¶
Now suppose we want to expose additional functionality in our model. In
particular, the summarize
function also has min_length
and
max_length
parameters. Although we could expose these options as additional
parameters in URL, Ray Serve also allows us to add more route options to the
URL itself and handle each route separately.
Because this logic can get complex, Serve integrates with
FastAPI. This allows us to define a Serve
deployment by adding the @serve.ingress
decorator to a FastAPI app. For
more info about FastAPI with Serve, please see FastAPI HTTP Deployments.
As an example of FastAPI, here’s a modified version of our Summarizer
class
with route options to request a minimum or maximum length of ten words in the
summaries:
1# File name: serve_with_fastapi.py
2import ray
3from ray import serve
4from fastapi import FastAPI
5from transformers import pipeline
6
7app = FastAPI()
8
9ray.init(address="auto", namespace="serve")
10serve.start(detached=True)
11
12
13@serve.deployment
14@serve.ingress(app)
15class Summarizer:
16 def __init__(self):
17 self.summarize = pipeline("summarization", model="t5-small")
18
19 @app.get("/")
20 def get_summary(self, txt: str):
21 summary_list = self.summarize(txt)
22 summary = summary_list[0]["summary_text"]
23 return summary
24
25 @app.get("/min10")
26 def get_summary_min10(self, txt: str):
27 summary_list = self.summarize(txt, min_length=10)
28 summary = summary_list[0]["summary_text"]
29 return summary
30
31 @app.get("/max10")
32 def get_summary_max10(self, txt: str):
33 summary_list = self.summarize(txt, max_length=10)
34 summary = summary_list[0]["summary_text"]
35 return summary
36
37
38Summarizer.deploy()
The class now exposes three routes:
/Summarizer
: As before, this route takes in article text and returns a summary./Summarizer/min10
: This route takes in article text and returns a summary with at least 10 words./Summarizer/max10
: This route takes in article text and returns a summary with at most 10 words.
Notice that Summarizer
’s methods no longer take in a Starlette request
object. Instead, they take in the URL’s txt parameter directly with FastAPI’s
query parameter
feature.
Since we still deploy our model locally, the full URL still uses the
localhost IP. This means each of our three routes comes after the
http://127.0.0.1:8000
IP and port address. As an example, we can make
requests to the max10
route using this client script:
# File name: fastapi_client.py
import requests
article_text = (
"HOUSTON -- Men have landed and walked on the moon. "
"Two Americans, astronauts of Apollo 11, steered their fragile "
"four-legged lunar module safely and smoothly to the historic landing "
"yesterday at 4:17:40 P.M., Eastern daylight time. Neil A. Armstrong, the "
"38-year-old commander, radioed to earth and the mission control room "
'here: "Houston, Tranquility Base here. The Eagle has landed." The '
"first men to reach the moon -- Armstrong and his co-pilot, Col. Edwin E. "
"Aldrin Jr. of the Air Force -- brought their ship to rest on a level, "
"rock-strewn plain near the southwestern shore of the arid Sea of "
"Tranquility. About six and a half hours later, Armstrong opened the "
"landing craft's hatch, stepped slowly down the ladder and declared as "
"he planted the first human footprint on the lunar crust: \"That's one "
'small step for man, one giant leap for mankind." His first step on the '
"moon came at 10:56:20 P.M., as a television camera outside the craft "
"transmitted his every move to an awed and excited audience of hundreds "
"of millions of people on earth."
)
response = requests.get(
"http://127.0.0.1:8000/Summarizer/max10?txt=" + article_text
).text
print(response)
$ ray stop
$ ray start --head
$ python serve_with_fastapi.py
$ python fastapi_client.py
"two astronauts steered their fragile lunar"
Congratulations! You just built and deployed a machine learning model on Ray Serve! You should now have enough context to dive into the Core API: Deployments to get a deeper understanding of Ray Serve.
To learn more about how to start a multi-node cluster for your Ray Serve deployments, see Deploying Ray Serve. For more interesting example applications, including integrations with popular machine learning frameworks and Python web servers, be sure to check out Advanced Tutorials.
Footnotes