ray.train.sklearn.SklearnTrainer#

class ray.train.sklearn.SklearnTrainer(*args, **kwargs)[source]#

Bases: ray.train.base_trainer.BaseTrainer

A Trainer for scikit-learn estimator training.

This Trainer runs the fit method of the given estimator in a non-distributed manner on a single Ray Actor.

By default, the n_jobs (or thread_count) estimator parameters will be set to match the number of CPUs assigned to the Ray Actor. This behavior can be disabled by setting set_estimator_cpus=False.

If you wish to use GPU-enabled estimators (eg. cuML), make sure to set "GPU": 1 in scaling_config.trainer_resources.

The results are reported all at once and not in an iterative fashion. No checkpointing is done during training. This may be changed in the future.

Example:

import ray

from ray.train.sklearn import SklearnTrainer
from sklearn.ensemble import RandomForestRegressor

train_dataset = ray.data.from_items(
    [{"x": x, "y": x + 1} for x in range(32)])
trainer = SklearnTrainer(
    estimator=RandomForestRegressor(),
    label_column="y",
    scaling_config=ray.air.config.ScalingConfig(
        trainer_resources={"CPU": 4}
    ),
    datasets={"train": train_dataset}
)
result = trainer.fit()
Parameters
  • estimator – A scikit-learn compatible estimator to use.

  • datasets – Ray Datasets to use for training and validation. Must include a β€œtrain” key denoting the training dataset. If a preprocessor is provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by the preprocessor if one is provided. All non-training datasets will be used as separate validation sets, each reporting separate metrics.

  • label_column – Name of the label column. A column with this name must be present in the training dataset. If None, no validation will be performed.

  • params – Optional dict of params to be set on the estimator before fitting. Useful for hyperparameter tuning.

  • scoring –

    Strategy to evaluate the performance of the model on the validation sets and for cross-validation. Same as in sklearn.model_selection.cross_validation. If scoring represents a single score, one can use:

    • a single string;

    • a callable that returns a single value.

    If scoring represents multiple scores, one can use:

    • a list or tuple of unique strings;

    • a callable returning a dictionary where the keys are the metric names and the values are the metric scores;

    • a dictionary with metric names as keys and callables a values.

  • cv –

    Determines the cross-validation splitting strategy. If specified, cross-validation will be run on the train dataset, in addition to computing metrics for validation datasets. Same as in sklearn.model_selection.cross_validation, with the exception of None. Possible inputs for cv are:

    • None, to skip cross-validation.

    • int, to specify the number of folds in a (Stratified)KFold,

    • CV splitter,

    • An iterable yielding (train, test) splits as arrays of indices.

    For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

    If you provide a β€œcv_groups” column in the train dataset, it will be used as group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a β€œGroup” cv instance (e.g., GroupKFold). This corresponds to the groups argument in sklearn.model_selection.cross_validation.

  • return_train_score_cv – Whether to also return train scores during cross-validation. Ignored if cv is None.

  • parallelize_cv – If set to True, will parallelize cross-validation instead of the estimator. If set to None, will detect if the estimator has any parallelism-related params (n_jobs or thread_count) and parallelize cross-validation if there are none. If False, will not parallelize cross-validation. Cannot be set to True if there are any GPUs assigned to the trainer. Ignored if cv is None.

  • set_estimator_cpus – If set to True, will automatically set the values of all n_jobs and thread_count parameters in the estimator (including in nested objects) to match the number of available CPUs.

  • scaling_config – Configuration for how to scale training. Only the trainer_resources key can be provided, as the training is not distributed.

  • run_config – Configuration for the execution of the training run.

  • preprocessor – A ray.data.Preprocessor to preprocess the provided datasets.

  • **fit_params – Additional kwargs passed to estimator.fit() method.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

training_loop() None[source]#

Loop called by fit() to run training and report results to Tune.

Note

This method runs on a remote process.

self.datasets have already been preprocessed by self.preprocessor.

You can use the Tune Function API functions (session.report() and session.get_checkpoint()) inside this training loop.

Example:

from ray.train.trainer import BaseTrainer

class MyTrainer(BaseTrainer):
    def training_loop(self):
        for epoch_idx in range(5):
            ...
            session.report({"epoch": epoch_idx})