Inspecting Training Results#
The return value of trainer.fit()
is a Result
object.
The Result
object contains, among other information:
The last reported checkpoint (to load the model) and its attached metrics
Error messages, if any errors occurred
Viewing metrics#
You can retrieve reported metrics that were attached to a checkpoint from the Result
object.
Common metrics include the training or validation loss, or prediction accuracies.
The metrics retrieved from the Result
object
correspond to those you passed to train.report
as an argument in your training function.
Note
Persisting free-floating metrics reported via ray.train.report(metrics, checkpoint=None)
is deprecated.
This also means that retrieving these metrics from the Result
object is deprecated.
Only metrics attached to checkpoints are persisted. See (Deprecated) Reporting free-floating metrics for more details.
Last reported metrics#
Use Result.metrics
to retrieve the
metrics attached to the last reported checkpoint.
result = trainer.fit()
print("Observed metrics:", result.metrics)
Dataframe of all reported metrics#
Use Result.metrics_dataframe
to retrieve
a pandas DataFrame of all metrics reported alongside checkpoints.
df = result.metrics_dataframe
print("Minimum loss", min(df["loss"]))
Retrieving checkpoints#
You can retrieve checkpoints reported to Ray Train from the Result
object.
Checkpoints contain all the information that is needed to restore the training state. This usually includes the trained model.
You can use checkpoints for common downstream tasks such as offline batch inference with Ray Data or online model serving with Ray Serve.
The checkpoints retrieved from the Result
object
correspond to those you passed to train.report
as an argument in your training function.
Last saved checkpoint#
Use Result.checkpoint
to retrieve the
last checkpoint.
print("Last checkpoint:", result.checkpoint)
with result.checkpoint.as_directory() as tmpdir:
# Load model from directory
...
Other checkpoints#
Sometimes you want to access an earlier checkpoint. For instance, if your loss increased after more training due to overfitting, you may want to retrieve the checkpoint with the lowest loss.
You can retrieve a list of all available checkpoints and their metrics with
Result.best_checkpoints
# Print available checkpoints
for checkpoint, metrics in result.best_checkpoints:
print("Loss", metrics["loss"], "checkpoint", checkpoint)
# Get checkpoint with minimal loss
best_checkpoint = min(
result.best_checkpoints, key=lambda checkpoint: checkpoint[1]["loss"]
)[0]
with best_checkpoint.as_directory() as tmpdir:
# Load model from directory
...
See also
See Saving and Loading Checkpoints for more information on checkpointing.
Accessing storage location#
If you need to retrieve the results later, you can get the storage location
of the training run with Result.path
.
This path will correspond to the storage_path you configured
in the RunConfig
. It will be a
(nested) subdirectory within that path, usually
of the form TrainerName_date-string/TrainerName_id_00000_0_...
.
The result also contains a pyarrow.fs.FileSystem
that can be used to
access the storage location, which is useful if the path is on cloud storage.
result_path: str = result.path
result_filesystem: pyarrow.fs.FileSystem = result.filesystem
print(f"Results location (fs, path) = ({result_filesystem}, {result_path})")
Viewing Errors#
If an error occurred during training,
Result.error
will be set and contain the exception
that was raised.
if result.error:
assert isinstance(result.error, Exception)
print("Got exception:", result.error)
Finding results on persistent storage#
All training results, including reported metrics, checkpoints, and error files, are stored on the configured persistent storage.
See the persistent storage guide to configure this location for your training run.