Launching Cloud Clusters

This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.

See this blog post for a step by step guide to using the Ray Cluster Launcher.

To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide here.

Ray with cloud providers

First, install boto (pip install boto3) and configure your AWS credentials in ~/.aws/credentials, as described in the boto docs.

Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided ray/python/ray/autoscaler/aws/example-full.yaml cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large spot workers.

Test that it works by running the following commands from your local machine:

# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml

# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ # Try running a Ray program with 'ray.init(address="auto")'.

# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml

See AWS Configurations for recipes on customizing AWS clusters.

Local On Premise Cluster (List of nodes)

You would use this mode if you want to run distributed Ray applications on some local nodes available on premise.

The most preferable way to run a Ray cluster on a private cluster of hosts is via the Ray Cluster Launcher.

There are two ways of running private clusters:

  • Manually managed, i.e., the user explicitly specifies the head and worker ips.

  • Automatically managed, i.e., the user only specifies a coordinator address to a coordinating server that automatically coordinates its head and worker ips.

Tip

To avoid getting the password prompt when running private clusters make sure to setup your ssh keys on the private cluster as follows:

$ ssh-keygen
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

You can get started by filling out the fields in the provided ray/python/ray/autoscaler/local/example-full.yaml. Be sure to specify the proper head_ip, list of worker_ips, and the ssh_user field.

Test that it works by running the following commands from your local machine:

# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to get a remote shell into the head node.
$ ray up ray/python/ray/autoscaler/local/example-full.yaml

# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/local/example-full.yaml
$ # Try running a Ray program with 'ray.init(address="auto")'.

# Tear down the cluster
$ ray down ray/python/ray/autoscaler/local/example-full.yaml

Manual Ray Cluster Setup

The most preferable way to run a Ray cluster is via the Ray Cluster Launcher. However, it is also possible to start a Ray cluster by hand.

This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed on each machine. To install Ray, follow the installation instructions.

Starting Ray on each machine

On the head node (just choose some node to be the head node), run the following. If the --port argument is omitted, Ray will choose port 6379, falling back to a random port.

$ ray start --head --port=6379
...
Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='<ip address>:6379' --redis-password='<password>'

If connection fails, check your firewall settings and network configuration.

The command will print out the address of the Redis server that was started (the local node IP address plus the port number you specified).

Then on each of the other nodes, run the following. Make sure to replace <address> with the value printed by the command on the head node (it should look something like 123.45.67.89:6379).

Note that if your compute nodes are on their own subnetwork with Network Address Translation, to connect from a regular machine outside that subnetwork, the command printed by the head node will not work. You need to find the address that will reach the head node from the second machine. If the head node has a domain address like compute04.berkeley.edu, you can simply use that in place of an IP address and rely on the DNS.

$ ray start --address=<address> --redis-password='<password>'
--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop

If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this with the flags --num-cpus=10 and --num-gpus=1. See the Configuration page for more information.

If you see Unable to connect to Redis. If the Redis instance is on a different machine, check that your firewall is configured properly., this means the --port is inaccessible at the given IP address (because, for example, the head node is not actually running Ray, or you have the wrong IP address).

If you see Ray runtime started., then the node successfully connected to the IP address at the --port. You should now be able to connect to the cluster with ray.init(address='auto').

If ray.init(address='auto') keeps repeating redis_context.cc:303: Failed to connect to Redis, retrying., then the node is failing to connect to some other port(s) besides the main port.

If connection fails, check your firewall settings and network configuration.

If the connection fails, to check whether each port can be reached from a node, you can use a tool such as nmap or nc.

$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up, received echo-reply ttl 60 (0.00087s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT     STATE SERVICE REASON         VERSION
6379/tcp open  redis   syn-ack ttl 60 Redis key-value store
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!

If the node cannot access that port at that IP address, you might see

$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up (0.0011s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT     STATE  SERVICE REASON       VERSION
6379/tcp closed redis   reset ttl 60
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused

Stopping Ray

When you want to stop the Ray processes, run ray stop on each node.

Additional Cloud Providers

To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the NodeProvider interface (100 LOC) and register it in node_provider.py. Contributions are welcome!

Security

On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.

Running a Ray program on the Ray cluster

To run a distributed Ray program, you’ll need to execute your program on the same machine as one of the nodes.

Within your program/script, you must call ray.init and add the address parameter to ray.init (like ray.init(address=...)). This causes Ray to connect to the existing cluster. For example:

ray.init(address="auto")

Note

A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.

To verify that the correct number of nodes have joined the cluster, you can run the following.

import time

@ray.remote
def f():
    time.sleep(0.01)
    return ray._private.services.get_node_ip_address()

# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))

What’s Next?

Now that you have a working understanding of the cluster launcher, check out:

Questions or Issues?

You can post questions or issues or feedback through the following channels:

  1. Discussion Board: For questions about Ray usage or feature requests.

  2. GitHub Issues: For bug reports.

  3. Ray Slack: For getting in touch with Ray maintainers.

  4. StackOverflow: Use the [ray] tag questions about Ray.