RaySGD: Deep Learning on Ray

Tip

Get in touch with us if you’re using or considering using RaySGD!

RaySGD is a lightweight library for distributed deep learning, allowing you to scale up and speed up training for your deep learning models.

The main features are:

  • Ease of use: Scale your single process training code to a cluster in just a couple lines of code.

  • Composability: RaySGD interoperates with Ray Tune to tune your distributed model and Ray Datasets to train on large amounts of data.

  • Interactivity: RaySGD fits in your workflow with support to run from any environment, including seamless Jupyter notebook support.

Note

This API is in its Alpha release (as of Ray 1.7) and may be revised in future Ray releases. If you encounter any bugs, please file an issue on GitHub. If you are looking for the previous API documentation, see RaySGD: Distributed Training Wrappers.

Intro to RaySGD

RaySGD is a library that aims to simplify distributed deep learning.

Frameworks: RaySGD is built to abstract away the coordination/configuration setup of distributed deep learning frameworks such as Pytorch Distributed and Tensorflow Distributed, allowing users to only focus on implementing training logic.

  • For Pytorch, RaySGD automatically handles the construction of the distributed process group.

  • For Tensorflow, RaySGD automatically handles the coordination of the TF_CONFIG. The current implementation assumes that the user will use a MultiWorkerMirroredStrategy, but this will change in the near future.

  • For Horovod, RaySGD automatically handles the construction of the Horovod runtime and Rendezvous server.

Built for data scientists/ML practitioners: RaySGD has support for standard ML tools and features that practitioners love:

  • Callbacks for early stopping

  • Checkpointing

  • Integration with Tensorboard, Weights/Biases, and MLflow

  • Jupyter notebooks

Integration with Ray Ecosystem: Distributed deep learning often comes with a lot of complexity.

  • Use Ray Datasets with RaySGD to handle and train on large amounts of data.

  • Use Ray Tune with RaySGD to leverage cutting edge hyperparameter techniques and distribute both your training and tuning.

  • You can leverage the Ray cluster launcher to launch autoscaling or spot instance clusters to train your model at scale on any cloud.

Quick Start

RaySGD abstracts away the complexity of setting up a distributed training system. Let’s take following simple examples:

This example shows how you can use RaySGD with PyTorch.

First, set up your dataset and model.

import torch
import torch.nn as nn

num_samples = 20
input_size = 10
layer_size = 15
output_size = 5

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(layer_size, output_size)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

# In this example we use a randomly generated dataset.
input = torch.randn(num_samples, input_size)
labels = torch.randn(num_samples, output_size)

Now define your single-worker PyTorch training function.

import torch.optim as optim

def train_func():
    num_epochs = 3
    model = NeuralNetwork()
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(num_epochs):
        output = model(input)
        loss = loss_fn(output, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"epoch: {epoch}, loss: {loss.item()}")

This training function can be executed with:

train_func()

Now let’s convert this to a distributed multi-worker training function!

First, update the training function code to use PyTorch’s DistributedDataParallel. With RaySGD, you just pass in your distributed data parallel code as as you would normally run it with torch.distributed.launch.

from torch.nn.parallel import DistributedDataParallel

def train_func_distributed():
    num_epochs = 3
    model = NeuralNetwork()
    model = DistributedDataParallel(model)
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(num_epochs):
        output = model(input)
        loss = loss_fn(output, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"epoch: {epoch}, loss: {loss.item()}")

Then, instantiate a Trainer that uses a "torch" backend with 4 workers, and use it to run the new training function!

from ray.util.sgd.v2 import Trainer

trainer = Trainer(backend="torch", num_workers=4)
trainer.start()
results = trainer.run(train_func_distributed)
trainer.shutdown()

See Porting code to RaySGD for a more comprehensive example.

Next steps: Check out the User Guide!