Object Fault Tolerance#

A Ray object has both data (the value returned when calling ray.get) and metadata (e.g., the location of the value). Data is stored in the Ray object store while the metadata is stored at the object’s owner. The owner of an object is the worker process that creates the original ObjectRef, e.g., by calling f.remote() or ray.put(). Note that this worker is usually a distinct process from the worker that creates the value of the object, except in cases of ray.put.

import ray
import numpy as np

def large_array():
    return np.zeros(int(1e5))

x = ray.put(1)  # The driver owns x and also creates the value of x.

y = large_array.remote()
# The driver is the owner of y, even though the value may be stored somewhere else.
# If the node that stores the value of y dies, Ray will automatically recover
# it by re-executing the large_array task.
# If the driver dies, anyone still using y will receive an OwnerDiedError.

Ray can automatically recover from data loss but not owner failure.

Recovering from data loss#

When an object value is lost from the object store, such as during node failures, Ray will use lineage reconstruction to recover the object. Ray will first automatically attempt to recover the value by looking for copies of the same object on other nodes. If none are found, then Ray will automatically recover the value by re-executing the task that previously created the value. Arguments to the task are recursively reconstructed through the same mechanism.

Lineage reconstruction currently has the following limitations:

  • The object, and any of its transitive dependencies, must have been generated by a task (actor or non-actor). This means that objects created by ray.put are not recoverable.

  • Tasks are assumed to be deterministic and idempotent. Thus, by default, objects created by actor tasks are not reconstructable. To allow reconstruction of actor task results, set the max_task_retries parameter to a non-zero value (see actor fault tolerance for more details).

  • Tasks will only be re-executed up to their maximum number of retries. By default, a non-actor task can be retried up to 3 times and an actor task cannot be retried. This can be overridden with the max_retries parameter for remote functions and the max_task_retries parameter for actors.

  • The owner of the object must still be alive (see below).

Lineage reconstruction can cause higher than usual driver memory usage because the driver keeps the descriptions of any tasks that may be re-executed in case of failure. To limit the amount of memory used by lineage, set the environment variable RAY_max_lineage_bytes (default 1GB) to evict lineage if the threshold is exceeded.

To disable lineage reconstruction entirely, set the environment variable RAY_TASK_MAX_RETRIES=0 during ray start or ray.init. With this setting, if there are no copies of an object left, an ObjectLostError will be raised.

Recovering from owner failure#

The owner of an object can die because of node or worker process failure. Currently, Ray does not support recovery from owner failure. In this case, Ray will clean up any remaining copies of the object’s value to prevent a memory leak. Any workers that subsequently try to get the object’s value will receive an OwnerDiedError exception, which can be handled manually.

Understanding ObjectLostErrors#

Ray throws an ObjectLostError to the application when an object cannot be retrieved due to application or system error. This can occur during a ray.get() call or when fetching a task’s arguments, and can happen for a number of reasons. Here is a guide to understanding the root cause for different error types:

  • OwnerDiedError: The owner of an object, i.e., the Python worker that first created the ObjectRef via .remote() or ray.put(), has died. The owner stores critical object metadata and an object cannot be retrieved if this process is lost.

  • ObjectReconstructionFailedError: This error is thrown if an object, or another object that this object depends on, cannot be reconstructed due to one of the limitations described above.

  • ReferenceCountingAssertionError: The object has already been deleted, so it cannot be retrieved. Ray implements automatic memory management through distributed reference counting, so this error should not happen in general. However, there is a known edge case that can produce this error.

  • ObjectFetchTimedOutError: A node timed out while trying to retrieve a copy of the object from a remote node. This error usually indicates a system-level bug. The timeout period can be configured using the RAY_fetch_fail_timeout_milliseconds environment variable (default 10 minutes).

  • ObjectLostError: The object was successfully created, but no copy is reachable. This is a generic error thrown when lineage reconstruction is disabled and all copies of the object are lost from the cluster.