Distributed Scikit-learn / Joblib#

Ray supports running distributed scikit-learn programs by implementing a Ray backend for joblib using Ray Actors instead of local processes. This makes it easy to scale existing applications that use scikit-learn from a single node to a cluster.


This API is new and may be revised in future Ray releases. If you encounter any bugs, please file an issue on GitHub.


To get started, first install Ray, then use from ray.util.joblib import register_ray and run register_ray(). This will register Ray as a joblib backend for scikit-learn to use. Then run your original scikit-learn code inside with joblib.parallel_backend('ray'). This will start a local Ray cluster. See the Run on a Cluster section below for instructions to run on a multi-node Ray cluster instead.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
digits = load_digits()
param_space = {
    'C': np.logspace(-6, 6, 30),
    'gamma': np.logspace(-8, 8, 30),
    'tol': np.logspace(-4, -1, 30),
    'class_weight': [None, 'balanced'],
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=5, n_iter=300, verbose=10)

import joblib
from ray.util.joblib import register_ray
with joblib.parallel_backend('ray'):
    search.fit(digits.data, digits.target)

You can also set the ray_remote_args argument in parallel_backend to configure the Ray Actors making up the Pool. This can be used to eg. assign resources to Actors, such as GPUs.

# Allows to use GPU-enabled estimators, such as cuML
with joblib.parallel_backend('ray', ray_remote_args=dict(num_gpus=1)):
    search.fit(digits.data, digits.target)

Run on a Cluster#

This section assumes that you have a running Ray cluster. To start a Ray cluster, please refer to the cluster setup instructions.

To connect a scikit-learn to a running Ray cluster, you have to specify the address of the head node by setting the RAY_ADDRESS environment variable.

You can also start Ray manually by calling ray.init() (with any of its supported configuration options) before calling with joblib.parallel_backend('ray').


If you do not set the RAY_ADDRESS environment variable and do not provide address in ray.init(address=<address>) then scikit-learn will run on a SINGLE node!