Deploy MCP servers#

   

This repository provides end-to-end examples for deploying and scaling Model Context Protocol (MCP) servers using Ray Serve and Anyscale Service, covering both streamable HTTP and stdio transport types:

Why Ray Serve for MCP#

  • Autoscaling: Dynamically adjusts replica count to match traffic peaks and maintain responsiveness

  • Load balancing: Intelligently distributes incoming requests across all replicas for steady throughput

  • Observability: Exposes real‑time metrics on request rates, resource usage & system health

  • Fault tolerance: Detects failures, restarts components, and reroutes traffic to healthy replicas for continuous availability

  • Composition: Chains deployments—pre‑process, infer, post‑process, and custom logic—into a single seamless pipeline

Anyscale service benefits#

  • Production ready: Enterprise‑grade infrastructure management and automated deployments for real‑world MCP traffic

  • High availability: Availability‑Zone‑aware scheduling and zero‑downtime rolling updates to maximize uptime

  • Logging and tracing: Comprehensive logs, distributed tracing, and real‑time dashboards for end‑to‑end observability

  • Head node fault tolerance: Managed head‑node redundancy to eliminate single points of failure in your Ray cluster coordination layer

Prerequisites#

  • Ray Serve, which is included in the base Docker image

  • Podman, to deploy MCP tools with existing Docker images for notebooks 3 through 5

  • A Brave API key set in your environment (BRAVE_API_KEY) for notebooks 3 and 4

  • MCP Python library

Development#

You can run this example on your own Ray cluster or on Anyscale workspaces, which enables development without worrying about infrastructure—like working on a laptop. Workspaces come with:

  • Development tools: Spin up a remote session from your local IDE (Cursor, VS Code, etc.) and start coding, using the tools you’re familiar with combined with the power of Anyscale’s compute.

  • Dependencies: Continue to install dependencies using familiar tools like pip. Anyscale propagates all dependencies to your cluster.

  • Compute: Leverage any reserved instance capacity, spot instance from any compute provider of your choice by deploying Anyscale into your account. Alternatively, you can use the Anyscale cloud for a full serverless experience.

  • Debugging: Leverage a distributed debugger to get the same VS Code-like debugging experience.

Learn more about Anyscale Workspaces in the official documentation.

Note: Run the entire tutorial for free on Anyscale—all dependencies come pre-installed, and compute autoscales automatically. To run it elsewhere, install the dependencies from the Dockerfiles provided and provision the appropriate resources…

Production#

Seamlessly integrate with your existing CI/CD pipelines by leveraging the Anyscale CLI or SDK to deploy highly available services and run reliable batch jobs. Developing in an environment nearly identical to production—a multi-node cluster—drastically accelerates the dev-to-prod transition. This tutorial also introduces proprietary RayTurbo features that optimize workloads for performance, fault tolerance, scale, and observability.

No infrastructure headaches#

Abstract away infrastructure from your ML/AI developers so they can focus on their core ML development. You can additionally better manage compute resources and costs with enterprise governance and observability and admin capabilities so you can set resource quotas, set priorities for different workloads and gain observability of your utilization across your entire compute fleet. If you’re running on a Kubernetes cloud (EKS, GKE, etc.), you can still access the proprietary RayTurbo optimizations demonstrated in this tutorial by deploying the Anyscale Kubernetes operator.