Distributed Scikit-learn / Joblib#
Ray supports running distributed scikit-learn programs by implementing a Ray backend for joblib using Ray Actors instead of local processes. This makes it easy to scale existing applications that use scikit-learn from a single node to a cluster.
Note
This API is new and may be revised in future Ray releases. If you encounter any bugs, please file an issue on GitHub.
Quickstart#
To get started, first install Ray, then use
from ray.util.joblib import register_ray
and run register_ray()
.
This will register Ray as a joblib backend for scikit-learn to use.
Then run your original scikit-learn code inside
with joblib.parallel_backend('ray')
. This will start a local Ray cluster.
See the Run on a Cluster section below for instructions to run on
a multi-node Ray cluster instead.
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
digits = load_digits()
param_space = {
'C': np.logspace(-6, 6, 30),
'gamma': np.logspace(-8, 8, 30),
'tol': np.logspace(-4, -1, 30),
'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=5, n_iter=300, verbose=10)
import joblib
from ray.util.joblib import register_ray
register_ray()
with joblib.parallel_backend('ray'):
search.fit(digits.data, digits.target)
You can also set the ray_remote_args
argument in parallel_backend
to configure
the Ray Actors
making up the Pool. This can be used to eg. assign resources
to Actors, such as GPUs.
# Allows to use GPU-enabled estimators, such as cuML
with joblib.parallel_backend('ray', ray_remote_args=dict(num_gpus=1)):
search.fit(digits.data, digits.target)
Run on a Cluster#
This section assumes that you have a running Ray cluster. To start a Ray cluster, please refer to the cluster setup instructions.
To connect a scikit-learn to a running Ray cluster, you have to specify the address of the
head node by setting the RAY_ADDRESS
environment variable.
You can also start Ray manually by calling ray.init()
(with any of its supported
configuration options) before calling with joblib.parallel_backend('ray')
.
Warning
If you do not set the RAY_ADDRESS
environment variable and do not provide
address
in ray.init(address=<address>)
then scikit-learn will run on a SINGLE node!