Manual Cluster Setup


If you’re using AWS, Azure or GCP you should use the automated setup commands.

The instructions in this document work well for small clusters. For larger clusters, consider using the pssh package: sudo apt-get install pssh or the setup commands for private clusters.

Deploying Ray on a Cluster

This section assumes that you have a cluster running and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed on each machine. To install Ray, follow the installation instructions.

Starting Ray on each machine

On the head node (just choose some node to be the head node), run the following. If the --port argument is omitted, Ray will choose port 6379, falling back to a random port.

ray start --head --port=6379

The command will print out the address of the Redis server that was started (and some other address information).

Then on all of the other nodes, run the following. Make sure to replace <address> with the value printed by the command on the head node (it should look something like

ray start --address=<address>

If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this with the flags --num-cpus=10 and --num-gpus=1. See the Configuration page for more information.

Now we’ve started all of the Ray processes on each node Ray. This includes

  • Some worker processes on each machine.

  • An object store on each machine.

  • A raylet on each machine.

  • Multiple Redis servers (on the head node).

To run some commands, start up Python on one of the nodes in the cluster, and do the following.

import ray

Now you can define remote functions and execute tasks. For example, to verify that the correct number of nodes have joined the cluster, you can run the following.

import time

def f():

# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))

Stopping Ray

When you want to stop the Ray processes, run ray stop on each node.