Scaling your Gradio app with Ray Serve

In this guide, we will show you how to scale up your Gradio application using Ray Serve. Keeping the internal architecture of your Gradio app intact (no changes), we simply wrap the app within Ray Serve as a deployment and scale it to access more resources.


To follow this tutorial, you will need Ray Serve and Gradio. If you haven’t already, install them by running:

$ pip install "ray[serve]"
$ pip install gradio

For this tutorial, we will use Gradio apps that run text summarization and generation models and use HuggingFace’s Pipelines to access these models. Note that you can substitute this Gradio app for any Gradio app of your own!

First, let’s install the transformers module.

$ pip install transformers

Quickstart: Deploy your Gradio app with Ray Serve

This section shows you an easy way to deploy your app onto Ray Serve. First, create a new Python file named Second, import GradioServer from Ray Serve to deploy your Gradio app later, gradio, and transformers.pipeline to load text summarization models.

from ray.serve.gradio_integrations import GradioServer

import gradio as gr

from transformers import pipeline

Then, we construct the (optional) Gradio app io. This application takes in text and uses the T5 Small text summarization model loaded using HuggingFace’s Pipelines to summarize that text.


Remember you can substitute this with your own Gradio app if you want to try scaling up your own Gradio app!

summarizer = pipeline("summarization", model="t5-small")

def model(text):
    summary_list = summarizer(text)
    summary = summary_list[0]["summary_text"]
    return summary

example = (
    "HOUSTON -- Men have landed and walked on the moon. "
    "Two Americans, astronauts of Apollo 11, steered their fragile "
    "four-legged lunar module safely and smoothly to the historic landing "
    "yesterday at 4:17:40 P.M., Eastern daylight time. Neil A. Armstrong, the "
    "38-year-old commander, radioed to earth and the mission control room "
    'here: "Houston, Tranquility Base here. The Eagle has landed." The '
    "first men to reach the moon -- Armstrong and his co-pilot, Col. Edwin E. "
    "Aldrin Jr. of the Air Force -- brought their ship to rest on a level, "
    "rock-strewn plain near the southwestern shore of the arid Sea of "
    "Tranquility. About six and a half hours later, Armstrong opened the "
    "landing craft's hatch, stepped slowly down the ladder and declared as "
    "he planted the first human footprint on the lunar crust: \"That's one "
    'small step for man, one giant leap for mankind." His first step on the '
    "moon came at 10:56:20 P.M., as a television camera outside the craft "
    "transmitted his every move to an awed and excited audience of hundreds "
    "of millions of people on earth."

io = gr.Interface(
    inputs=[gr.inputs.Textbox(default=example, label="Input prompt")],
    outputs=[gr.outputs.Textbox(label="Model output")],

Deploying Gradio Server

In order to deploy your Gradio app onto Ray Serve, you need to wrap your Gradio app in a Serve deployment. GradioServer acts as that wrapper. It serves your Gradio app remotely on Ray Serve so that it can process and respond to HTTP requests.

Replicas in a deployment are copies of your program running on Ray Serve, where each replica runs on a separate Ray cluster node’s worker process. More replicas scales your deployment by serving more client requests. By wrapping your application in GradioServer, you can increase the number of replicas of your application or increase the number of CPUs and/or GPUs available to each replica.


GradioServer is simply GradioIngress but wrapped in a Serve deployment. You can use GradioServer for the simple wrap-and-deploy use case, but as you will see in the next section, you can use GradioIngress to define your own Gradio Server for more customized use cases.

Using either the example app io we created above or an existing Gradio app (of type Interface, Block, Parallel, etc.), wrap it in your Gradio Server.

app = GradioServer.options(num_replicas=2, ray_actor_options={"num_cpus": 4}).bind(io)

Finally, deploy your Gradio Server! Run the following in your terminal:

$ serve run demo:app

Now you can access your Gradio app at http://localhost:8000! This is what it should look like: Gradio Result

See the Production Guide for more information on how to deploy your app in production.

Parallelizing models with Ray Serve

You can run multiple models in parallel with Ray Serve by utilizing the deployment graph in Ray Serve.

Original Approach

Suppose you want to run the following program.

  1. Take two text generation models, gpt2 and EleutherAI/gpt-neo-125M.

  2. Run the two models on the same input text, such that the generated text has a minimum length of 20 and maximum length of 100.

  3. Display the outputs of both models using Gradio.

This is how you would do it normally:

generator1 = pipeline("text-generation", model="gpt2")
generator2 = pipeline("text-generation", model="EleutherAI/gpt-neo-125M")

def model1(text):
    generated_list = generator1(text, do_sample=True, min_length=20, max_length=100)
    generated = generated_list[0]["generated_text"]
    return generated

def model2(text):
    generated_list = generator2(text, do_sample=True, min_length=20, max_length=100)
    generated = generated_list[0]["generated_text"]
    return generated

demo = gr.Interface(
    lambda text: f"{model1(text)}\n------------\n{model2(text)}", "textbox", "textbox"

Parallelize using Ray Serve

With Ray Serve, we can parallelize the two text generation models by wrapping each model in a separate Ray Serve deployment. Deployments are defined by decorating a Python class or function with @serve.deployment, and they usually wrap the models that you want to deploy on Ray Serve to handle incoming requests.

Let’s walk through a few steps to achieve parallelism. First, let’s import our dependencies. Note that we need to import GradioIngress instead of GradioServer like before since we’re now building a customized MyGradioServer that can run models in parallel.

import ray
from ray import serve
from ray.serve.gradio_integrations import GradioIngress

import gradio as gr

from transformers import pipeline

Then, let’s wrap our gpt2 and EleutherAI/gpt-neo-125M models in Serve deployments, named TextGenerationModel.

class TextGenerationModel:
    def __init__(self, model_name):
        self.generator = pipeline("text-generation", model=model_name)

    def __call__(self, text):
        generated_list = self.generator(
            text, do_sample=True, min_length=20, max_length=100
        generated = generated_list[0]["generated_text"]
        return generated

app1 = TextGenerationModel.bind("gpt2")
app2 = TextGenerationModel.bind("EleutherAI/gpt-neo-125M")

Next, instead of simply wrapping our Gradio app in a GradioServer deployment, we can build our own MyGradioServer that reroutes the Gradio app so that it runs the TextGenerationModel deployments:

class MyGradioServer(GradioIngress):
    def __init__(self, downstream_model_1, downstream_model_2):
        self._d1 = downstream_model_1
        self._d2 = downstream_model_2

        io = gr.Interface(self.fanout, "textbox", "textbox")

    def fanout(self, text):
        [result1, result2] = ray.get([self._d1.remote(text), self._d2.remote(text)])
        return f"{result1}\n------------\n{result2}"

Lastly, we link everything together:

app = MyGradioServer.bind(app1, app2)


This will bind your two text generation models (wrapped in Serve deployments) to MyGradioServer._d1 and MyGradioServer._d2, forming a deployment graph. Thus, we have built our Gradio Interface io such that it calls MyGradioServer.fanout(), which simply sends requests to your two text generation models that are deployed on Ray Serve.

Now, you can run your scalable app, and the two text generation models will run in parallel on Ray Serve. Run your Gradio app with the following command:

$ serve run demo:app

Access your Gradio app at http://localhost:8000, and you should see the following interactive interface: Gradio Result

See the Production Guide for more information on how to deploy your app in production.