ray.train.sklearn.SklearnTrainer
ray.train.sklearn.SklearnTrainer#
- class ray.train.sklearn.SklearnTrainer(*args, **kwargs)[source]#
Bases:
ray.train.base_trainer.BaseTrainer
A Trainer for scikit-learn estimator training.
This Trainer runs the
fit
method of the given estimator in a non-distributed manner on a single Ray Actor.By default, the
n_jobs
(orthread_count
) estimator parameters will be set to match the number of CPUs assigned to the Ray Actor. This behavior can be disabled by settingset_estimator_cpus=False
.If you wish to use GPU-enabled estimators (eg. cuML), make sure to set
"GPU": 1
inscaling_config.trainer_resources
.The results are reported all at once and not in an iterative fashion. No checkpointing is done during training. This may be changed in the future.
Example:
import ray from ray.train.sklearn import SklearnTrainer from sklearn.ensemble import RandomForestRegressor train_dataset = ray.data.from_items( [{"x": x, "y": x + 1} for x in range(32)]) trainer = SklearnTrainer( estimator=RandomForestRegressor(), label_column="y", scaling_config=ray.air.config.ScalingConfig( trainer_resources={"CPU": 4} ), datasets={"train": train_dataset} ) result = trainer.fit()
- Parameters
estimator β A scikit-learn compatible estimator to use.
datasets β Ray Datasets to use for training and validation. Must include a βtrainβ key denoting the training dataset. If a
preprocessor
is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by thepreprocessor
if one is provided. All non-training datasets will be used as separate validation sets, each reporting separate metrics.label_column β Name of the label column. A column with this name must be present in the training dataset. If None, no validation will be performed.
params β Optional dict of params to be set on the estimator before fitting. Useful for hyperparameter tuning.
scoring β
Strategy to evaluate the performance of the model on the validation sets and for cross-validation. Same as in
sklearn.model_selection.cross_validation
. Ifscoring
represents a single score, one can use:a single string;
a callable that returns a single value.
If
scoring
represents multiple scores, one can use:a list or tuple of unique strings;
a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
a dictionary with metric names as keys and callables a values.
cv β
Determines the cross-validation splitting strategy. If specified, cross-validation will be run on the train dataset, in addition to computing metrics for validation datasets. Same as in
sklearn.model_selection.cross_validation
, with the exception of None. Possible inputs forcv
are:None, to skip cross-validation.
int, to specify the number of folds in a
(Stratified)KFold
,CV splitter
,An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. These splitters are instantiated withshuffle=False
so the splits will be the same across calls.If you provide a βcv_groupsβ column in the train dataset, it will be used as group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a βGroupβ
cv
instance (e.g.,GroupKFold
). This corresponds to thegroups
argument insklearn.model_selection.cross_validation
.return_train_score_cv β Whether to also return train scores during cross-validation. Ignored if
cv
is None.parallelize_cv β If set to True, will parallelize cross-validation instead of the estimator. If set to None, will detect if the estimator has any parallelism-related params (
n_jobs
orthread_count
) and parallelize cross-validation if there are none. If False, will not parallelize cross-validation. Cannot be set to True if there are any GPUs assigned to the trainer. Ignored ifcv
is None.set_estimator_cpus β If set to True, will automatically set the values of all
n_jobs
andthread_count
parameters in the estimator (including in nested objects) to match the number of available CPUs.scaling_config β Configuration for how to scale training. Only the
trainer_resources
key can be provided, as the training is not distributed.run_config β Configuration for the execution of the training run.
preprocessor β A ray.data.Preprocessor to preprocess the provided datasets.
**fit_params β Additional kwargs passed to
estimator.fit()
method.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
- training_loop() None [source]#
Loop called by fit() to run training and report results to Tune.
Note
This method runs on a remote process.
self.datasets
have already been preprocessed byself.preprocessor
.You can use the Tune Function API functions (
session.report()
andsession.get_checkpoint()
) inside this training loop.Example:
from ray.train.trainer import BaseTrainer class MyTrainer(BaseTrainer): def training_loop(self): for epoch_idx in range(5): ... session.report({"epoch": epoch_idx})