Deploy custom Model Control Planes (MCP) servers#

   

This tutorial demonstrates how to build and deploy custom Model Control Plane (MCP) servers using Ray Serve in both HTTP streaming and stdio modes. MCP enables scalable, dynamic, and multi-tenant model serving by decoupling model routing from application logic.

Prerequisites#

  • Ray Serve, which is included in the base Docker image

  • Podman, to deploy MCP tools with existing Docker images for notebooks 3 through 5

  • A Brave API key set in your environment (BRAVE_API_KEY)

  • an MCP Python library

Setting the API key#

Before running notebooks 3 and 4, you must set your Brave API key:

export BRAVE_API_KEY=your-api-key

Development#

You can run this example on your own Ray cluster or on Anyscale workspaces, which enables development without worrying about infrastructure—like working on a laptop. Workspaces come with:

  • Development tools: Spin up a remote session from your local IDE (Cursor, VS Code, etc.) and start coding, using the tools you’re familiar with combined with the power of Anyscale’s compute.

  • Dependencies: Continue to install dependencies using familiar tools like pip. Anyscale propagates all dependencies to your cluster.

pip install -q "ray[serve]" "fastapi" "httpx" "uvicorn" "aiohttp" "tqdm"
  • Compute: Leverage any reserved instance capacity, spot instance from any compute provider of your choice by deploying Anyscale into your account. Alternatively, you can use the Anyscale cloud for a full serverless experience.

    • Under the hood, a cluster spins up and is efficiently managed by Anyscale.

  • Debugging: Leverage a distributed debugger to get the same VS Code-like debugging experience.

Learn more about Anyscale Workspaces in the official documentation.

Note: Run the entire tutorial for free on Anyscale—all dependencies come pre-installed, and compute autoscales automatically. To run it elsewhere, install the dependencies from the Dockerfiles provided and provision the appropriate resources…

Production#

Seamlessly integrate with your existing CI/CD pipelines by leveraging the Anyscale CLI or SDK to deploy highly available services and run reliable batch jobs. Developing in an environment nearly identical to production—a multi-node cluster—drastically accelerates the dev-to-prod transition. This tutorial also introduces proprietary RayTurbo features that optimize workloads for performance, fault tolerance, scale, and observability.

No infrastructure headaches#

Abstract away infrastructure from your ML/AI developers so they can focus on their core ML development. You can additionally better manage compute resources and costs with enterprise governance and observability and admin capabilities so you can set resource quotas, set priorities for different workloads and gain observability of your utilization across your entire compute fleet. If you’re running on a Kubernetes cloud (EKS, GKE, etc.), you can still access the proprietary RayTurbo optimizations demonstrated in this tutorial by deploying the Anyscale Kubernetes operator.