External library integrations (tune.integration)

Docker (tune.integration.docker)

ray.tune.integration.docker.DockerSyncer(local_dir: str, remote_dir: str, sync_client: Optional[ray.tune.sync_client.SyncClient] = None)[source]

DockerSyncer used for synchronization between Docker containers. This syncer extends the node syncer, but is usually instantiated without a custom sync client. The sync client defaults to DockerSyncClient instead.

Set the env var TUNE_SYNCER_VERBOSITY to increase verbosity of syncing operations (0, 1, 2, 3). Defaults to 0.

Note

This syncer only works with the Ray cluster launcher. If you use your own Docker setup, make sure the nodes can connect to each other via SSH, and try the regular SSH-based syncer instead.

Example:

from ray.tune.integration.docker import DockerSyncer
tune.run(train,
         sync_config=tune.SyncConfig(
             sync_to_driver=DockerSyncer))

Keras (tune.integration.keras)

class ray.tune.integration.keras.TuneReportCallback(metrics: Union[None, str, List[str], Dict[str, str]] = None, on: Union[str, List[str]] = 'epoch_end')[source]

Keras to Ray Tune reporting callback

Reports metrics to Ray Tune.

Parameters
  • metrics (str|list|dict) – Metrics to report to Tune. If this is a list, each item describes the metric key reported to Keras, and it will reported under the same name to Tune. If this is a dict, each key will be the name reported to Tune and the respective value will be the metric key reported to Keras. If this is None, all Keras logs will be reported.

  • on (str|list) – When to trigger checkpoint creations. Must be one of the Keras event hooks (less the on_), e.g. “train_start”, or “predict_end”. Defaults to “epoch_end”.

Example:

from ray.tune.integration.keras import TuneReportCallback

# Report accuracy to Tune after each epoch:
model.fit(
    x_train,
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=0,
    validation_data=(x_test, y_test),
    callbacks=[TuneReportCallback(
        {"mean_accuracy": "accuracy"}, on="epoch_end")])
class ray.tune.integration.keras.TuneReportCheckpointCallback(metrics: Union[None, str, List[str], Dict[str, str]] = None, filename: str = 'checkpoint', frequency: Union[int, List[int]] = 1, on: Union[str, List[str]] = 'epoch_end')[source]

Keras report and checkpoint callback

Saves checkpoints after each validation step. Also reports metrics to Tune, which is needed for checkpoint registration.

Use this callback to register saved checkpoints with Ray Tune. This means that checkpoints will be manages by the CheckpointManager and can be used for advanced scheduling and search algorithms, like Population Based Training.

The tf.keras.callbacks.ModelCheckpoint callback also saves checkpoints, but doesn’t register them with Ray Tune.

Parameters
  • metrics (str|list|dict) – Metrics to report to Tune. If this is a list, each item describes the metric key reported to Keras, and it will reported under the same name to Tune. If this is a dict, each key will be the name reported to Tune and the respective value will be the metric key reported to Keras. If this is None, all Keras logs will be reported.

  • filename (str) – Filename of the checkpoint within the checkpoint directory. Defaults to “checkpoint”.

  • frequency (int|list) – Checkpoint frequency. If this is an integer n, checkpoints are saved every n times each hook was called. If this is a list, it specifies the checkpoint frequencies for each hook individually.

  • on (str|list) – When to trigger checkpoint creations. Must be one of the Keras event hooks (less the on_), e.g. “train_start”, or “predict_end”. Defaults to “epoch_end”.

Example:

from ray.tune.integration.keras import TuneReportCheckpointCallback

# Save checkpoint and report accuracy to Tune after each epoch:
model.fit(
    x_train,
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=0,
    validation_data=(x_test, y_test),
    callbacks=[TuneReportCheckpointCallback(
        metrics={"mean_accuracy": "accuracy"},
        filename="model",
        on="epoch_end")])

Kubernetes (tune.integration.kubernetes)

ray.tune.integration.kubernetes.NamespacedKubernetesSyncer(namespace)[source]

Wrapper to return a KubernetesSyncer for a Kubernetes namespace.

Parameters

namespace (str) – Kubernetes namespace.

Returns: A KubernetesSyncer class to be passed to tune.run().

Example:

from ray.tune.integration.kubernetes import NamespacedKubernetesSyncer
tune.run(train,
         sync_config=tune.SyncConfig(
             sync_to_driver=NamespacedKubernetesSyncer("ray")))

MXNet (tune.integration.mxnet)

class ray.tune.integration.mxnet.TuneReportCallback(metrics: Union[None, str, List[str], Dict[str, str]] = None)[source]

MXNet to Ray Tune reporting callback

Reports metrics to Ray Tune.

This has to be passed to MXNet as the eval_end_callback.

Parameters

metrics (str|list|dict) – Metrics to report to Tune. If this is a list, each item describes the metric key reported to MXNet, and it will reported under the same name to Tune. If this is a dict, each key will be the name reported to Tune and the respective value will be the metric key reported to MXNet.

Example:

from ray.tune.integration.mxnet import TuneReportCallback

# mlp_model is a MXNet model
mlp_model.fit(
    train_iter,
    # ...
    eval_metric="acc",
    eval_end_callback=TuneReportCallback({
        "mean_accuracy": "accuracy"
    }))
class ray.tune.integration.mxnet.TuneCheckpointCallback(filename: str = 'checkpoint', frequency: int = 1)[source]

MXNet checkpoint callback

Saves checkpoints after each epoch.

This has to be passed to the epoch_end_callback of the MXNet model.

Checkpoint are currently not registered if no tune.report() call is made afterwards. You have to use this in conjunction with the TuneReportCallback to work!

Parameters
  • filename (str) – Filename of the checkpoint within the checkpoint directory. Defaults to “checkpoint”.

  • frequency (int) – Integer indicating how often checkpoints should be saved.

Example:

from ray.tune.integration.mxnet import TuneReportCallback,             TuneCheckpointCallback

# mlp_model is a MXNet model
mlp_model.fit(
    train_iter,
    # ...
    eval_metric="acc",
    eval_end_callback=TuneReportCallback({
        "mean_accuracy": "accuracy"
    }),
    epoch_end_callback=TuneCheckpointCallback(
        filename="mxnet_cp",
        frequency=3
    ))

PyTorch Lightning (tune.integration.pytorch_lightning)

class ray.tune.integration.pytorch_lightning.TuneReportCallback(metrics: Union[None, str, List[str], Dict[str, str]] = None, on: Union[str, List[str]] = 'validation_end')[source]

PyTorch Lightning to Ray Tune reporting callback

Reports metrics to Ray Tune.

Parameters
  • metrics (str|list|dict) – Metrics to report to Tune. If this is a list, each item describes the metric key reported to PyTorch Lightning, and it will reported under the same name to Tune. If this is a dict, each key will be the name reported to Tune and the respective value will be the metric key reported to PyTorch Lightning.

  • on (str|list) – When to trigger checkpoint creations. Must be one of the PyTorch Lightning event hooks (less the on_), e.g. “batch_start”, or “train_end”. Defaults to “validation_end”.

Example:

import pytorch_lightning as pl
from ray.tune.integration.pytorch_lightning import TuneReportCallback

# Report loss and accuracy to Tune after each validation epoch:
trainer = pl.Trainer(callbacks=[TuneReportCallback(
        ["val_loss", "val_acc"], on="validation_end")])

# Same as above, but report as `loss` and `mean_accuracy`:
trainer = pl.Trainer(callbacks=[TuneReportCallback(
        {"loss": "val_loss", "mean_accuracy": "val_acc"},
        on="validation_end")])
class ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback(metrics: Union[None, str, List[str], Dict[str, str]] = None, filename: str = 'checkpoint', on: Union[str, List[str]] = 'validation_end')[source]

PyTorch Lightning report and checkpoint callback

Saves checkpoints after each validation step. Also reports metrics to Tune, which is needed for checkpoint registration.

Parameters
  • metrics (str|list|dict) – Metrics to report to Tune. If this is a list, each item describes the metric key reported to PyTorch Lightning, and it will reported under the same name to Tune. If this is a dict, each key will be the name reported to Tune and the respective value will be the metric key reported to PyTorch Lightning.

  • filename (str) – Filename of the checkpoint within the checkpoint directory. Defaults to “checkpoint”.

  • on (str|list) – When to trigger checkpoint creations. Must be one of the PyTorch Lightning event hooks (less the on_), e.g. “batch_start”, or “train_end”. Defaults to “validation_end”.

Example:

import pytorch_lightning as pl
from ray.tune.integration.pytorch_lightning import (
    TuneReportCheckpointCallback)

# Save checkpoint after each training batch and after each
# validation epoch.
trainer = pl.Trainer(callbacks=[TuneReportCheckpointCallback(
    metrics={"loss": "val_loss", "mean_accuracy": "val_acc"},
    filename="trainer.ckpt", on="validation_end")])

Torch (tune.integration.torch)

ray.tune.integration.torch.DistributedTrainableCreator(func: Callable, num_workers: int = 1, num_cpus_per_worker: int = 1, num_gpus_per_worker: int = 0, num_workers_per_host: Optional[int] = None, backend: str = 'gloo', timeout_s: int = 10, use_gpu=None) → Type[ray.tune.integration.torch._TorchTrainable][source]

Creates a class that executes distributed training.

Similar to running torch.distributed.launch.

Note that you typically should not instantiate the object created.

Parameters
  • func (callable) – This function is a Tune trainable function. This function must have 2 args in the signature, and the latter arg must contain checkpoint_dir. For example: func(config, checkpoint_dir=None).

  • num_workers (int) – Number of training workers to include in world.

  • num_cpus_per_worker (int) – Number of CPU resources to reserve per training worker.

  • num_gpus_per_worker (int) – Number of GPU resources to reserve per training worker.

  • num_workers_per_host – Optional[int]: Number of workers to colocate per host.

  • backend (str) – One of “gloo”, “nccl”.

  • timeout_s (float) – Seconds before the torch process group times out. Useful when machines are unreliable. Defaults to 60 seconds. This value is also reused for triggering placement timeouts if forcing colocation.

Returns

A trainable class object that can be passed to Tune. Resources are automatically set within the object, so users do not need to set resources_per_trainable.

Return type

type(Trainable)

Example:

trainable_cls = DistributedTrainableCreator(
    train_func, num_workers=2)
analysis = tune.run(trainable_cls)
ray.tune.integration.torch.distributed_checkpoint_dir(step: int, disable: bool = False) → Generator[str, None, None][source]

ContextManager for creating a distributed checkpoint.

Only checkpoints a file on the “main” training actor, avoiding redundant work.

Parameters
  • step (int) – Used to label the checkpoint

  • disable (bool) – Disable for prototyping.

Yields

str – A path to a directory. This path will be used again when invoking the training_function.

Example:

def train_func(config, checkpoint_dir):
    if checkpoint_dir:
        path = os.path.join(checkpoint_dir, "checkpoint")
        model_state_dict = torch.load(path)

    if epoch % 3 == 0:
        with distributed_checkpoint_dir(step=epoch) as checkpoint_dir:
            path = os.path.join(checkpoint_dir, "checkpoint")
            torch.save(model.state_dict(), path)
ray.tune.integration.torch.is_distributed_trainable()[source]

Returns True if executing within a DistributedTrainable.

Horovod (tune.integration.horovod)

ray.tune.integration.horovod.DistributedTrainableCreator(func: Callable, use_gpu: bool = False, num_hosts: int = 1, num_slots: int = 1, num_cpus_per_slot: int = 1, timeout_s: int = 30, replicate_pem: bool = False) → Type[ray.tune.integration.horovod._HorovodTrainable][source]

Converts Horovod functions to be executable by Tune.

Requires horovod > 0.19 to work.

This function wraps and sets the resources for a given Horovod function to be used with Tune. It generates a Horovod Trainable (trial) which can itself be a distributed training job. One basic assumption of this implementation is that all sub-workers of a trial will be placed evenly across different machines.

It is recommended that if num_hosts per trial > 1, you set num_slots == the size (or number of GPUs) of a single host. If num_hosts == 1, then you can set num_slots to be <= the size (number of GPUs) of a single host.

This above assumption can be relaxed - please file a feature request on Github to inform the maintainers.

Another assumption is that this API requires gloo as the underlying communication primitive. You will need to install Horovod with HOROVOD_WITH_GLOO enabled.

Fault Tolerance: The trial workers themselves are not fault tolerant. When a host of a trial fails, all workers of a trial are expected to die, and the trial is expected to restart. This currently does not support function checkpointing.

Parameters
  • func (Callable[[dict], None]) – A training function that takes in a config dict for hyperparameters and should initialize horovod via horovod.init.

  • use_gpu (bool) –

  • num_cpus_per_slot (int) – Number of CPUs to request from Ray per worker.

  • num_hosts (int) – Number of hosts that each trial is expected to use.

  • num_slots (int) – Number of slots (workers) to start on each host.

  • timeout_s (int) – Seconds for Horovod rendezvous to timeout.

  • replicate_pem (bool) – THIS MAY BE INSECURE. If true, this will replicate the underlying Ray cluster ssh key across all hosts. This may be useful if using the Ray Autoscaler.

Returns

Trainable class that can be passed into tune.run.

Example:

def train(config):
    horovod.init()
    horovod.allreduce()

from ray.tune.integration.horovod import DistributedTrainableCreator
trainable_cls = DistributedTrainableCreator(
    train, num_hosts=1, num_slots=2, use_gpu=True)

tune.run(trainable_cls)

New in version 1.0.0.

Weights and Biases (tune.integration.wandb)

See also here.

class ray.tune.integration.wandb.WandbLoggerCallback(project: str, group: Optional[str] = None, api_key_file: Optional[str] = None, api_key: Optional[str] = None, excludes: Optional[List[str]] = None, log_config: bool = False, **kwargs)[source]

Weights and biases (https://www.wandb.com/) is a tool for experiment tracking, model optimization, and dataset versioning. This Ray Tune LoggerCallback sends metrics to Wandb for automatic tracking and visualization.

Parameters
  • project (str) – Name of the Wandb project. Mandatory.

  • group (str) – Name of the Wandb group. Defaults to the trainable name.

  • api_key_file (str) – Path to file containing the Wandb API KEY. This file only needs to be present on the node running the Tune script if using the WandbLogger.

  • api_key (str) – Wandb API Key. Alternative to setting api_key_file.

  • excludes (list) – List of metrics that should be excluded from the log.

  • log_config (bool) – Boolean indicating if the config parameter of the results dict should be logged. This makes sense if parameters will change during training, e.g. with PopulationBasedTraining. Defaults to False.

  • **kwargs – The keyword arguments will be pased to wandb.init().

Wandb’s group, run_id and run_name are automatically selected by Tune, but can be overwritten by filling out the respective configuration values.

Please see here for all other valid configuration settings: https://docs.wandb.com/library/init

Example:

from ray.tune.logger import DEFAULT_LOGGERS
from ray.tune.integration.wandb import WandbLoggerCallback
tune.run(
    train_fn,
    config={
        # define search space here
        "parameter_1": tune.choice([1, 2, 3]),
        "parameter_2": tune.choice([4, 5, 6]),
    },
    callbacks=[WandbLoggerCallback(
        project="Optimization_Project",
        api_key_file="/path/to/file",
        log_config=True)])
ray.tune.integration.wandb.wandb_mixin(func: Callable)[source]

Weights and biases (https://www.wandb.com/) is a tool for experiment tracking, model optimization, and dataset versioning. This Ray Tune Trainable mixin helps initializing the Wandb API for use with the Trainable class or with @wandb_mixin for the function API.

For basic usage, just prepend your training function with the @wandb_mixin decorator:

from ray.tune.integration.wandb import wandb_mixin

@wandb_mixin
def train_fn(config):
    wandb.log()

Wandb configuration is done by passing a wandb key to the config parameter of tune.run() (see example below).

The content of the wandb config entry is passed to wandb.init() as keyword arguments. The exception are the following settings, which are used to configure the WandbTrainableMixin itself:

Parameters
  • api_key_file (str) – Path to file containing the Wandb API KEY. This file must be on all nodes if using the wandb_mixin.

  • api_key (str) – Wandb API Key. Alternative to setting api_key_file.

Wandb’s group, run_id and run_name are automatically selected by Tune, but can be overwritten by filling out the respective configuration values.

Please see here for all other valid configuration settings: https://docs.wandb.com/library/init

Example:

from ray import tune
from ray.tune.integration.wandb import wandb_mixin

@wandb_mixin
def train_fn(config):
    for i in range(10):
        loss = self.config["a"] + self.config["b"]
        wandb.log({"loss": loss})
    tune.report(loss=loss, done=True)

tune.run(
    train_fn,
    config={
        # define search space here
        "a": tune.choice([1, 2, 3]),
        "b": tune.choice([4, 5, 6]),
        # wandb configuration
        "wandb": {
            "project": "Optimization_Project",
            "api_key_file": "/path/to/file"
        }
    })

XGBoost (tune.integration.xgboost)

class ray.tune.integration.xgboost.TuneReportCallback(metrics: Union[None, str, List[str], Dict[str, str]] = None)[source]

XGBoost to Ray Tune reporting callback

Reports metrics to Ray Tune.

Parameters

metrics (str|list|dict) – Metrics to report to Tune. If this is a list, each item describes the metric key reported to XGBoost, and it will reported under the same name to Tune. If this is a dict, each key will be the name reported to Tune and the respective value will be the metric key reported to XGBoost. If this is None, all metrics will be reported to Tune under their default names as obtained from XGBoost.

Example:

import xgboost
from ray.tune.integration.xgboost import TuneReportCallback

config = {
    # ...
    "eval_metric": ["auc", "logloss"]
}

# Report only log loss to Tune after each validation epoch:
bst = xgb.train(
    config,
    train_set,
    evals=[(test_set, "eval")],
    verbose_eval=False,
    callbacks=[TuneReportCallback({"loss": "eval-logloss"})])
class ray.tune.integration.xgboost.TuneReportCheckpointCallback(metrics: Union[None, str, List[str], Dict[str, str]] = None, filename: str = 'checkpoint')[source]

XGBoost report and checkpoint callback

Saves checkpoints after each validation step. Also reports metrics to Tune, which is needed for checkpoint registration.

Parameters
  • metrics (str|list|dict) – Metrics to report to Tune. If this is a list, each item describes the metric key reported to XGBoost, and it will reported under the same name to Tune. If this is a dict, each key will be the name reported to Tune and the respective value will be the metric key reported to XGBoost.

  • filename (str) – Filename of the checkpoint within the checkpoint directory. Defaults to “checkpoint”. If this is None, all metrics will be reported to Tune under their default names as obtained from XGBoost.

Example:

import xgboost
from ray.tune.integration.xgboost import TuneReportCheckpointCallback

config = {
    # ...
    "eval_metric": ["auc", "logloss"]
}

# Report only log loss to Tune after each validation epoch.
# Save model as `xgboost.mdl`.
bst = xgb.train(
    config,
    train_set,
    evals=[(test_set, "eval")],
    verbose_eval=False,
    callbacks=[TuneReportCheckpointCallback(
        {"loss": "eval-logloss"}, "xgboost.mdl)])