Host an object detection model as a service#

Ray Serve is a scalable model-serving framework that allows deploying machine learning models as microservices. This tutorial uses Ray Serve to deploy an object detection model using Faster R-CNN. The model detects whether a person is wearing a mask correctly, incorrectly, or not at all.

Anyscale-specific configuration

Note: This tutorial is optimized for the Anyscale platform. When running on open source Ray, additional configuration is required. For example, you need to manually:

Configure your Ray Cluster: Set up your multi-node environment, including head and worker nodes, and manage resource allocation like autoscaling and GPU/CPU assignments, without the Anyscale automation. See Ray Clusters for details.
Manage dependencies: Install and manage dependencies on each node because you won’t have Anyscale’s Docker-based dependency management. See Environment Dependencies for instructions on installing and updating Ray in your environment.
Set up storage: Configure your own distributed or shared storage system instead of relying on Anyscale’s integrated cluster storage. See Configuring Persistent Storage for suggestions on setting up shared storage solutions.

Why use Ray Serve and Anyscale#

Scalability and performance#

Automatic scaling: Ray Serve scales horizontally, which means your deployment can handle a growing number of requests by distributing the load across multiple machines and GPUs. This feature is particularly useful for production environments where traffic can be unpredictable.
Efficient resource utilization: With features like fractional GPU allocation and dynamic scheduling, Ray Serve uses resources efficiently, resulting in lower operational costs while maintaining high throughput for model inferences.

Framework-agnostic model serving#

Broad compatibility: Whether you’re using deep learning frameworks like PyTorch, TensorFlow, or Keras, or even traditional libraries such as Scikit-Learn, Ray Serve offers a unified platform to deploy these models.
Flexible API development: Beyond serving models, you can integrate any Python business logic. This capability makes composing multiple models and integrating additional services into a single inference pipeline easier.

Advanced features for modern applications#

Dynamic request batching: This feature allows multiple small inference requests to be batched together, reducing the per-request overhead and increasing overall efficiency.
Response streaming: For apps that need to return large outputs or stream data in real-time, response streaming can improve user experience and reduce latency.
Model composition: You can build complex, multi-step inference pipelines that integrate various models, allowing you to construct end-to-end services that combine machine learning and custom business logic.

Building on Ray Serve, Anyscale Service elevates this deployment by offering a fully managed platform that streamlines infrastructure management. It automatically scales resources, integrates seamlessly with cloud services, and provides robust monitoring and security features. Together, Ray Serve and Anyscale Service enable you to deploy the mask detection model as a scalable, efficient, and reliable microservice in a production environment, effectively abstracting operational complexities while ensuring optimal performance.

Inspect `object_detection.py`#

To start, inspect the file object_detection.py. This module implements a Ray Serve deployment for an object detection service using FastAPI.

The code initializes a FastAPI app and uses Ray Serve to deploy two classes, one for handling HTTP requests (APIIngress) and one for performing object detection (ObjectDetection). This separation of concerns—APIIngress for HTTP interfacing and ObjectDetection for image processing—allows for scalable, efficient handling of requests, with Ray Serve managing resource allocation and replicas.

The APIIngress class serves as the entry point for HTTP requests using FastAPI, exposing an endpoint (“/detect”) that accepts image URLs and returns processed images. When a request hits this endpoint, APIIngress asynchronously delegates the task to the ObjectDetection service by calling its detect method.

Following is the explanation of the decorators for APIIngress class:

@serve.deployment(num_replicas=1): This decorator indicates that the ingress service, which primarily routes HTTP requests using FastAPI, runs as a single instance. For this example, it mainly acts as a lightweight router to forward requests to the actual detection service. A single replica is typically sufficient. To handle high traffic volume in production, increase this number.
@serve.ingress(app): This decorator integrates the FastAPI app with Ray Serve. It makes the API endpoints defined in the FastAPI app accessible through the deployment. Essentially, it enables serving HTTP traffic directly through this deployment.

The ObjectDetection class handles the core functionality: it loads a pre-trained Faster R-CNN model, processes incoming images, runs object detection to identify mask-wearing statuses, and visually annotates the images with bounding boxes and labels.

Following is the explanation of the decorators for ObjectDetection class:

ray_actor_options={"num_gpus": 1}: This configuration assigns one GPU to each replica of the ObjectDetection service. Given that the service loads a deep learning model (Faster R-CNN) for mask detection, having GPU resources is essential for accelerating inference. This parameter makes sense if your infrastructure has GPU resources available and you want each actor to leverage hardware acceleration.
autoscaling_config={"min_replicas": 1, "max_replicas": 10}: min_replicas: 1 ensures that at least one replica is always running, providing baseline availability. max_replicas: 10 limits the maximum number of replicas to 10, which helps control resource usage while accommodating potential spikes in traffic.

Then, bind the deployment with optional arguments to the constructor to define an app. Finally, deploy the resulting app using serve.run (or the equivalent serve run CLI command).

For more details, see: https://docs.ray.io/en/latest/serve/configure-serve-deployment.html

Run the object detection service with Ray Serve#

To launch the object detection service, launch the terminal from an Anyscale workspace and use the following command:

! serve run object_detection:entrypoint --non-blocking

Send a request to the service#

To test the deployed model, send an HTTP request to the service using Python. The following code fetches an image, sends it to the detection service, and displays the output:

import requests
from PIL import Image
from io import BytesIO
from IPython.display import display

image_url = "https://face-masks-data.s3.us-east-2.amazonaws.com/all/images/maksssksksss5.png"
resp = requests.get(f"http://127.0.0.1:8000/detect?image_url={image_url}")

# Display the image
image = Image.open(BytesIO(resp.content))
display(image)

Shut down the service#

Use the following command to shut down the service:

!serve shutdown --yes

Production deployment#

For production deployment, use Anyscale Services to deploy the Ray Serve application to a dedicated cluster without modifying the code. Anyscale ensures scalability, fault tolerance, and load balancing, keeping the service resilient against node failures, high traffic, and rolling updates.

Deploy as an Anyscale Service#

Use the following to deploy my_service in a single command:

!anyscale service deploy object_detection:entrypoint --name=face_mask_detection_service

Check the status of the service#

To get the status of my_service, run the following:

!anyscale service status --name=face_mask_detection_service

Query the service#

When you deploy, you expose the service to a publicly accessible IP address, which you can send requests to.

In the preceding cell’s output, copy the API_KEY and BASE_URL. As an example, the values look like the following:

API_KEY: xkRQv_4MENV7iq34gUprbQrX3NUqpk6Bv6UQpiq6Cbc
BASE_URL: https://face-mask-detection-service-bxauk.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com

Fill in the following placeholder values for the BASE_URL and API_KEY in the following Python requests object:

import requests

API_KEY = "xkRQv_4MENV7iq34gUprbQrX3NUqpk6Bv6UQpiq6Cbc"  # PASTE HERE
BASE_URL = "https://face-mask-detection-service-bxauk.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com"  # PASTE HERE, remove the slash as the last character.

def detect_masks(image_url: str):
    response: requests.Response = requests.get(
        f"{BASE_URL}/detect",
        params={"image_url": image_url},
        headers={
            "Authorization": f"Bearer {API_KEY}",
        },
    )
    response.raise_for_status()
    return response  

Then you can call the service API and obtain the detection results:

from PIL import Image
from io import BytesIO
from IPython.display import display

image_url = "https://face-masks-data.s3.us-east-2.amazonaws.com/all/images/maksssksksss5.png"
resp = detect_masks(image_url)
# Display the image.
image = Image.open(BytesIO(resp.content))
display(image)

Advanced configurations#

For production environments, Anyscale recommends using a Serve config YAML file, which provides a centralized way to manage system-level settings and application-specific configurations. This approach enables seamless updates and scaling of your deployments by modifying the config file and applying changes without service interruptions. For a comprehensive guide on configuring Ray Serve deployments, see the official documentation: https://docs.ray.io/en/latest/serve/configure-serve-deployment.html

Terminate your service#

Remember to terminate your service after testing, otherwise it keeps running:

anyscale service terminate –name=face_mask_detection_service

Clean up the cluster storage#

You can see what files are stored in the cluster_storage. You can see the file fasterrcnn_model_mask_detection.pth that you created for fast model loading and serving.

ls -lah /mnt/cluster_storage/

Remember to cleanup the cluster storage by removing it:

rm -rf /mnt/cluster_storage/fasterrcnn_model_mask_detection.pth