Launching clusters on the cloud

This section provides instructions for configuring the Ray Cluster Launcher to use with AWS/Azure/GCP, an existing Kubernetes cluster, or on a private cluster of host machines.

See this blog post for a step by step guide to using the Ray Cluster Launcher.

AWS (EC2)

First, install boto (pip install boto3) and configure your AWS credentials in ~/.aws/credentials, as described in the boto docs.

Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided ray/python/ray/autoscaler/aws/example-full.yaml cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large spot workers.

Test that it works by running the following commands from your local machine:

# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml

# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ source activate tensorflow_p36
$ # Try running a Ray program with 'ray.init(address="auto")'.

# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml

Tip

For the AWS node configuration, you can set "ImageId: latest_dlami" to automatically use the newest Deep Learning AMI for your region. For example, head_node: {InstanceType: c5.xlarge, ImageId: latest_dlami}.

Using Amazon EFS

To use Amazon EFS, install some utilities and mount the EFS in setup_commands. Note that these instructions only work if you are using the AWS Autoscaler.

Note

You need to replace the {{FileSystemId}} to your own EFS ID before using the config. You may also need to set correct SecurityGroupIds for the instances in the config file.

setup_commands:
    - sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
        sudo pkill -9 apt-get;
        sudo pkill -9 dpkg;
        sudo dpkg --configure -a;
        sudo apt-get -y install binutils;
        cd $HOME;
        git clone https://github.com/aws/efs-utils;
        cd $HOME/efs-utils;
        ./build-deb.sh;
        sudo apt-get -y install ./build/amazon-efs-utils*deb;
        cd $HOME;
        mkdir efs;
        sudo mount -t efs {{FileSystemId}}:/ efs;
        sudo chmod 777 efs;

Azure

First, install the Azure CLI (pip install azure-cli azure-core) then login using (az login).

Set the subscription to use from the command line (az account set -s <subscription_id>) or by modifying the provider section of the config provided e.g: ray/python/ray/autoscaler/azure/example-full.yaml

Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided ray/python/ray/autoscaler/azure/example-full.yaml cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 spot workers. Note that you’ll need to fill in your resource group and location in those templates.

Test that it works by running the following commands from your local machine:

# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml

# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
# test ray setup
# enable conda environment
$ exec bash -l
$ conda activate py37_tensorflow
$ python -c 'import ray; ray.init()'
$ exit
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml

Azure Portal

Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through the Ray autoscaler. This will deploy Azure Data Science VMs (DSVM) for both the head node and the auto-scalable cluster managed by Azure Virtual Machine Scale Sets. The head node conveniently exposes both SSH as well as JupyterLab.

Deploy to Azure

Once the template is successfully deployed the deployment output page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input). Use the following code in a Jupyter notebook to connect to the Ray cluster.

import ray
ray.init(address='auto')

Note that on each node the azure-init.sh script is executed and performs the following actions:

  1. Activates one of the conda environments available on DSVM

  2. Installs Ray and any other user-specified dependencies

  3. Sets up a systemd task (/lib/systemd/system/ray.service) to start Ray in head or worker mode

GCP

First, install the Google API client (pip install google-api-python-client), set up your GCP credentials, and create a new GCP project.

Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided ray/python/ray/autoscaler/gcp/example-full.yaml cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 preemptible workers. Note that you’ll need to fill in your project id in those templates.

Test that it works by running the following commands from your local machine:

# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml

# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ source activate tensorflow_p36
$ # Try running a Ray program with 'ray.init(address="auto")'.

# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml

Kubernetes

The cluster launcher can also be used to start Ray clusters on an existing Kubernetes cluster. First, install the Kubernetes API client (pip install kubernetes), then make sure your Kubernetes credentials are set up properly to access the cluster (if a command like kubectl get pods succeeds, you should be good to go).

Once you have kubectl configured locally to access the remote cluster, you should be ready to launch your cluster. The provided ray/python/ray/autoscaler/kubernetes/example-full.yaml cluster config file will create a small cluster of one pod for the head node configured to autoscale up to two worker node pods, with all pods requiring 1 CPU and 0.5GiB of memory.

Test that it works by running the following commands from your local machine:

# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to get a remote shell into the head node.
$ ray up ray/python/ray/autoscaler/kubernetes/example-full.yaml

# List the pods running in the cluster. You shoud only see one head node
# until you start running an application, at which point worker nodes
# should be started. Don't forget to include the Ray namespace in your
# 'kubectl' commands ('ray' by default).
$ kubectl -n ray get pods

# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/kubernetes/example-full.yaml
$ # Try running a Ray program with 'ray.init(address="auto")'.

# Tear down the cluster
$ ray down ray/python/ray/autoscaler/kubernetes/example-full.yaml

Tip

This section describes the easiest way to launch a Ray cluster on Kubernetes. See this document for advanced usage of Kubernetes with Ray.

Private Cluster (List of nodes)

The most preferable way to run a Ray cluster on a private cluster of hosts is via the Ray Cluster Launcher.

You can get started by filling out the fields in the provided ray/python/ray/autoscaler/local/example-full.yaml. Be sure to specify the proper head_ip, list of worker_ips, and the ssh_user field.

Test that it works by running the following commands from your local machine:

# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to get a remote shell into the head node.
$ ray up ray/python/ray/autoscaler/local/example-full.yaml

# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/local/example-full.yaml
$ # Try running a Ray program with 'ray.init(address="auto")'.

# Tear down the cluster
$ ray down ray/python/ray/autoscaler/local/example-full.yaml

External Node Provider

Ray also supports external node providers (check node_provider.py implementation). You can specify the external node provider using the yaml config:

provider:
    type: external
    module: mypackage.myclass

The module needs to be in the format package.provider_class or package.sub_package.provider_class.

Additional Cloud Providers

To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the NodeProvider interface (100 LOC) and register it in node_provider.py. Contributions are welcome!

Security

On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.