Contents

Cluster YAML Configuration Options

The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes. Once the cluster configuration is defined, you will need to use the Ray CLI to perform any operations such as starting and stopping the cluster.

Custom types

Auth

Node types

The available_nodes_types object’s keys represent the names of the different node types.

Deleting a node type from available_node_types and updating with ray up will cause the autoscaler to scale down all nodes of that type. In particular, changing the key of a node type object will result in removal of nodes corresponding to the old key; nodes with the new key name will then be created according to cluster configuration and Ray resource demands.

<node_type_1_name>:
    node_config:
        Node config
    resources:
        Resources
    min_workers: int
    max_workers: int
    worker_setup_commands:
        - str
    docker:
        Node Docker
<node_type_2_name>:
    ...
...

Node config

Cloud-specific configuration for nodes of a given node type.

Modifying the node_config and updating with ray up will cause the autoscaler to scale down all existing nodes of the node type; nodes with the newly applied node_config will then be created according to cluster configuration and Ray resource demands.

A YAML object which conforms to the EC2 create_instances API in the AWS docs.

Resources

CPU: int
GPU: int
object_store_memory: int
memory: int
<custom_resource1>: int
<custom_resource2>: int
...

File mounts

<path1_on_remote_machine>: str # Path 1 on local machine
<path2_on_remote_machine>: str # Path 2 on local machine
...

Properties and Definitions

cluster_name

The name of the cluster. This is the namespace of the cluster.

  • Required: Yes

  • Importance: High

  • Type: String

  • Default: “default”

  • Pattern: [a-zA-Z0-9_]+

max_workers

The maximum number of workers the cluster will have at any given time.

  • Required: No

  • Importance: High

  • Type: Integer

  • Default: 2

  • Minimum: 0

  • Maximum: Unbounded

upscaling_speed

The number of nodes allowed to be pending as a multiple of the current number of nodes. For example, if set to 1.0, the cluster can grow in size by at most 100% at any time, so if the cluster currently has 20 nodes, at most 20 pending launches are allowed.

  • Required: No

  • Importance: Medium

  • Type: Float

  • Default: 1.0

  • Minimum: 0.0

  • Maximum: Unbounded

idle_timeout_minutes

The number of minutes that need to pass before an idle worker node is removed by the Autoscaler.

  • Required: No

  • Importance: Medium

  • Type: Integer

  • Default: 5

  • Minimum: 0

  • Maximum: Unbounded

docker

Configure Ray to run in Docker containers.

  • Required: No

  • Importance: High

  • Type: Docker

  • Default: {}

In rare cases when Docker is not available on the system by default (e.g., bad AMI), add the following commands to initialization_commands to install it.

initialization_commands:
    - curl -fsSL https://get.docker.com -o get-docker.sh
    - sudo sh get-docker.sh
    - sudo usermod -aG docker $USER
    - sudo systemctl restart docker -f

provider

The cloud provider-specific configuration properties.

  • Required: Yes

  • Importance: High

  • Type: Provider

auth

Authentication credentials that Ray will use to launch nodes.

  • Required: Yes

  • Importance: High

  • Type: Auth

available_node_types

Tells the autoscaler the allowed node types and the resources they provide. Each node type is identified by a user-specified key.

  • Required: No

  • Importance: High

  • Type: Node types

  • Default:

available_node_types:
  ray.head.default:
      node_config:
        InstanceType: m5.large
        BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                  VolumeSize: 100
      resources: {"CPU": 2}
      min_workers: 0
      max_workers: 0
  ray.worker.default:
      node_config:
        InstanceType: m5.large
        InstanceMarketOptions:
            MarketType: spot
      resources: {"CPU": 2}
      min_workers: 0

head_node_type

The key for one of the node types in available_node_types. This node type will be used to launch the head node.

If the field head_node_type is changed and an update is executed with ray up, the currently running head node will be considered outdated. The user will receive a prompt asking to confirm scale-down of the outdated head node, and the cluster will restart with a new head node. Changing the node_config of the node_type with key head_node_type will also result in cluster restart after a user prompt.

  • Required: Yes

  • Importance: High

  • Type: String

  • Pattern: [a-zA-Z0-9_]+

file_mounts

The files or directories to copy to the head and worker nodes.

  • Required: No

  • Importance: High

  • Type: File mounts

  • Default: []

cluster_synced_files

A list of paths to the files or directories to copy from the head node to the worker nodes. The same path on the head node will be copied to the worker node. This behavior is a subset of the file_mounts behavior, so in the vast majority of cases one should just use file_mounts.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

rsync_exclude

A list of patterns for files to exclude when running rsync up or rsync down. The filter is applied on the source directory only.

Example for a pattern in the list: **/.git/**.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

rsync_filter

A list of patterns for files to exclude when running rsync up or rsync down. The filter is applied on the source directory and recursively through all subdirectories.

Example for a pattern in the list: .gitignore.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

initialization_commands

A list of commands that will be run before the setup commands. If Docker is enabled, these commands will run outside the container and before Docker is setup.

  • Required: No

  • Importance: Medium

  • Type: List of String

  • Default: []

setup_commands

A list of commands to run to set up nodes. These commands will always run on the head and worker nodes and will be merged with head setup commands for head and with worker setup commands for workers.

  • Required: No

  • Importance: Medium

  • Type: List of String

  • Default:

# Default setup_commands:
setup_commands:
  - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc
  - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl
  • Setup commands should ideally be idempotent (i.e., can be run multiple times without changing the result); this allows Ray to safely update nodes after they have been created. You can usually make commands idempotent with small modifications, e.g. git clone foo can be rewritten as test -e foo || git clone foo which checks if the repo is already cloned first.

  • Setup commands are run sequentially but separately. For example, if you are using anaconda, you need to run conda activate env && pip install -U ray because splitting the command into two setup commands will not work.

  • Ideally, you should avoid using setup_commands by creating a docker image with all the dependencies preinstalled to minimize startup time.

  • Tip: if you also want to run apt-get commands during setup add the following list of commands:

    setup_commands:
      - sudo pkill -9 apt-get || true
      - sudo pkill -9 dpkg || true
      - sudo dpkg --configure -a
    

head_setup_commands

A list of commands to run to set up the head node. These commands will be merged with the general setup commands.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

worker_setup_commands

A list of commands to run to set up the worker nodes. These commands will be merged with the general setup commands.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

head_start_ray_commands

Commands to start ray on the head node. You don’t need to change this.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default:

head_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands

Command to start ray on worker nodes. You don’t need to change this.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default:

worker_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

docker.image

The default Docker image to pull in the head and worker nodes. This can be overridden by the head_image and worker_image fields. If neither image nor (head_image and worker_image) are specified, Ray will not use Docker.

  • Required: Yes (If Docker is in use.)

  • Importance: High

  • Type: String

The Ray project provides Docker images on DockerHub. The repository includes following images:

  • rayproject/ray-ml:latest-gpu: CUDA support, includes ML dependencies.

  • rayproject/ray:latest-gpu: CUDA support, no ML dependencies.

  • rayproject/ray-ml:latest: No CUDA support, includes ML dependencies.

  • rayproject/ray:latest: No CUDA support, no ML dependencies.

docker.head_image

Docker image for the head node to override the default docker image.

  • Required: No

  • Importance: Low

  • Type: String

docker.worker_image

Docker image for the worker nodes to override the default docker image.

  • Required: No

  • Importance: Low

  • Type: String

docker.container_name

The name to use when starting the Docker container.

  • Required: Yes (If Docker is in use.)

  • Importance: Low

  • Type: String

  • Default: ray_container

docker.pull_before_run

If enabled, the latest version of image will be pulled when starting Docker. If disabled, docker run will only pull the image if no cached version is present.

  • Required: No

  • Importance: Medium

  • Type: Boolean

  • Default: True

docker.run_options

The extra options to pass to docker run.

  • Required: No

  • Importance: Medium

  • Type: List of String

  • Default: []

docker.head_run_options

The extra options to pass to docker run for head node only.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

docker.worker_run_options

The extra options to pass to docker run for worker nodes only.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

docker.disable_automatic_runtime_detection

If enabled, Ray will not try to use the NVIDIA Container Runtime if GPUs are present.

  • Required: No

  • Importance: Low

  • Type: Boolean

  • Default: False

docker.disable_shm_size_detection

If enabled, Ray will not automatically specify the size /dev/shm for the started container and the runtime’s default value (64MiB for Docker) will be used. If --shm-size=<> is manually added to run_options, this is automatically set to True, meaning that Ray will defer to the user-provided value.

  • Required: No

  • Importance: Low

  • Type: Boolean

  • Default: False

auth.ssh_user

The user that Ray will authenticate with when launching new nodes.

  • Required: Yes

  • Importance: High

  • Type: String

auth.ssh_private_key

The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and KeyName has to be defined in the node configuration.

  • Required: No

  • Importance: Low

  • Type: String

auth.ssh_public_key

Not available.

provider.type

The cloud service provider. For AWS, this must be set to aws.

  • Required: Yes

  • Importance: High

  • Type: String

provider.region

The region to use for deployment of the Ray cluster.

  • Required: Yes

  • Importance: High

  • Type: String

  • Default: us-west-2

provider.availability_zone

A string specifying a comma-separated list of availability zone(s) that nodes may be launched in.

  • Required: No

  • Importance: Low

  • Type: String

  • Default: us-west-2a,us-west-2b

provider.location

Not available.

provider.resource_group

Not available.

provider.subscription_id

Not available.

provider.project_id

Not available.

provider.cache_stopped_nodes

If enabled, nodes will be stopped when the cluster scales down. If disabled, nodes will be terminated instead. Stopped nodes launch faster than terminated nodes.

  • Required: No

  • Importance: Low

  • Type: Boolean

  • Default: True

available_node_types.<node_type_name>.node_type.node_config

The configuration to be used to launch the nodes on the cloud service provider. Among other things, this will specify the instance type to be launched.

available_node_types.<node_type_name>.node_type.resources

The resources that a node type provides, which enables the autoscaler to automatically select the right type of nodes to launch given the resource demands of the application. The resources specified will be automatically passed to the ray start command for the node via an environment variable. If not provided, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers. For more information, see also the resource demand scheduler

  • Required: Yes (except for AWS/K8s)

  • Importance: High

  • Type: Resources

  • Default: {}

In some cases, adding special nodes without any resources may be desirable. Such nodes can be used as a driver which connects to the cluster to launch jobs. In order to manually add a node to an autoscaled cluster, the ray-cluster-name tag should be set and ray-node-type tag should be set to unmanaged. Unmanaged nodes can be created by setting the resources to {} and the maximum workers to 0. The Autoscaler will not attempt to start, stop, or update unmanaged nodes. The user is responsible for properly setting up and cleaning up unmanaged nodes.

available_node_types.<node_type_name>.node_type.min_workers

The minimum number of workers to maintain for this node type regardless of utilization.

  • Required: No

  • Importance: High

  • Type: Integer

  • Default: 0

  • Minimum: 0

  • Maximum: Unbounded

available_node_types.<node_type_name>.node_type.max_workers

The maximum number of workers to have in the cluster for this node type regardless of utilization. This takes precedence over minimum workers. By default, the number of workers of a node type is unbounded, constrained only by the cluster-wide max_workers. (Prior to Ray 1.3.0, the default value for this field was 0.)

  • Required: No

  • Importance: High

  • Type: Integer

  • Default: cluster-wide max_workers

  • Minimum: 0

  • Maximum: cluster-wide max_workers

available_node_types.<node_type_name>.node_type.worker_setup_commands

A list of commands to run to set up worker nodes of this type. These commands will replace the general worker setup commands for the node.

  • Required: No

  • Importance: low

  • Type: List of String

  • Default: []

available_node_types.<node_type_name>.node_type.resources.CPU

The number of CPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers.

  • Required: Yes (except for AWS/K8s)

  • Importance: High

  • Type: Integer

available_node_types.<node_type_name>.node_type.resources.GPU

The number of GPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers.

  • Required: No

  • Importance: Low

  • Type: Integer

available_node_types.<node_type_name>.node_type.resources.memory

The memory in bytes allocated for python worker heap memory on the node. If not configured, Autoscaler will automatically detect the amount of RAM on the node for AWS/Kubernetes and allocate 70% of it for the heap.

  • Required: No

  • Importance: Low

  • Type: Integer

available_node_types.<node_type_name>.node_type.resources.object-store-memory

The memory in bytes allocated for the object store on the node. If not configured, Autoscaler will automatically detect the amount of RAM on the node for AWS/Kubernetes and allocate 30% of it for the object store.

  • Required: No

  • Importance: Low

  • Type: Integer

available_node_types.<node_type_name>.docker

A set of overrides to the top-level Docker configuration.

  • Required: No

  • Importance: Low

  • Type: docker

  • Default: {}

Examples

Minimal configuration

# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# The maximum number of workers nodes to launch in addition to the head
# node. min_workers default to 0.
max_workers: 1

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

Full configuration

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options: []  # Extra options to pass into "docker run"

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    # Availability zone(s), comma-separated, that nodes may be launched in.
    # Nodes are currently spread between zones by a round-robin approach,
    # however this implementation detail should not be relied upon.
    availability_zone: us-west-2a,us-west-2b
    # Whether to allow node reuse. If set to False, nodes will be terminated
    # instead of stopped.
    cache_stopped_nodes: True # If not present, the default is True.

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 0
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
            # You can provision additional disk space with a conf as follows
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 100
            # Additional options in the boto docs.
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
            # Run workers on spot by default. Comment this out to use on-demand.
            InstanceMarketOptions:
                MarketType: spot
                # Additional options can be found in the boto docs, e.g.
                #   SpotOptions:
                #       MaxPrice: MAX_HOURLY_PRICE
            # Additional options in the boto docs.

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}