# GPT-J-6B Serving with Ray AIR#

In this example, we will showcase how to use the Ray AIR for GPT-J serving (online inference). GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information on GPT-J, click here.

We will use Ray Serve for online inference and a pretrained model from Hugging Face hub. Note that you can easily adapt this example to use other similar models.

It is highly recommended to read Ray AIR Key Concepts and Ray Serve Key Concepts before starting this example.

If you are interested in batch prediction (offline inference), see GPT-J-6B Batch Prediction with Ray AIR.

Note

In order to run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The amount of memory needed will depend on the model.

model_id = "EleutherAI/gpt-j-6B"
revision = "float16"  # use float16 weights to fit in 16GB GPUs
prompt = (
"In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
"previously unexplored valley, in the Andes Mountains. Even more surprising to the "
"researchers was the fact that the unicorns spoke perfect English."
)

import ray


We define a runtime environment to ensure that the Ray workers have access to all the necessary packages. You can omit the runtime_env argument if you have all of the packages already installed on each node in your cluster.

ray.init(
runtime_env={
"pip": [
"accelerate>=0.16.0",
"transformers>=4.26.0",
"numpy<1.24",  # remove when mlflow updates beyond 2.2
"torch",
]
}
)


Setting up basic serving with Ray Serve is very similar to batch inference with Ray Data. First, we define a callable class that will serve as the Serve deployment. At runtime, a deployment consists of a number of replicas, which are individual copies of the class or function that are started in separate Ray Actors (processes). The number of replicas can be scaled up or down (or even autoscaled) to match the incoming request load.

We make sure to set the deployment to use 1 GPU by setting "num_gpus" in ray_actor_options. We load the model in __init__, which will allow us to save time by initializing a model just once and then use it to handle multiple requests.

Tip

If you want to use inter-node model parallelism, you can also increase num_gpus. As we have created the model with device_map="auto", it will be automatically placed on correct devices. Note that this requires nodes with multiple GPUs.

import pandas as pd

from ray import serve
from starlette.requests import Request

@serve.deployment(ray_actor_options={"num_gpus": 1})
class PredictDeployment:
def __init__(self, model_id: str, revision: str = None):
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

self.model = AutoModelForCausalLM.from_pretrained(
model_id,
revision=revision,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",  # automatically makes use of all GPUs available to the Actor
)
self.tokenizer = AutoTokenizer.from_pretrained(model_id)

def generate(self, text: str) -> pd.DataFrame:
input_ids = self.tokenizer(text, return_tensors="pt").input_ids.to(
self.model.device
)

gen_tokens = self.model.generate(
input_ids,
do_sample=True,
temperature=0.9,
max_length=100,
)
return pd.DataFrame(
self.tokenizer.batch_decode(gen_tokens), columns=["responses"]
)

async def __call__(self, http_request: Request) -> str:
json_request: str = await http_request.json()
prompts = []
for prompt in json_request:
text = prompt["text"]
if isinstance(text, list):
prompts.extend(text)
else:
prompts.append(text)
return self.generate(prompts)


We can now bind the deployment with our arguments, and use run() to start it.

Note

If you were running this script outside of a Jupyter notebook, the recommended way is to use the serve run CLI command. In this case, you would remove the serve.run(deployment) line, and instead start the deployment by calling serve run FILENAME:deployment.

deployment = PredictDeployment.bind(model_id=model_id, revision=revision)
serve.run(deployment)

RayServeSyncHandle(deployment='PredictDeployment')


Let’s try submitting a request to our deployment. We will use the same prompt as before, and send a POST request. The deployment will generate a response and return it.

import requests

prompt = (
"In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
"previously unexplored valley, in the Andes Mountains. Even more surprising to the "
"researchers was the fact that the unicorns spoke perfect English."
)

sample_input = {"text": prompt}

output = requests.post("http://localhost:8000/", json=[sample_input]).json()
print(output)

(ServeReplica:PredictDeployment pid=651, ip=10.0.8.161) The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
(ServeReplica:PredictDeployment pid=651, ip=10.0.8.161) Setting pad_token_id to eos_token_id:50256 for open-end generation.

[{'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nThe findings come from a recent expedition to the region of Cordillera del Divisor, in northern Peru. The region was previously known to have an unusually high number of native animals.\n\n"Our team was conducting a population census of the region’'}]


You may notice that we are not using an AIR Predictor here. This is because Predictors are mainly intended to be used with AIR Checkpoints, which we don’t for this example. See Using Predictors for Inference for more information and usage examples.