Object Fault Tolerance
Object Fault Tolerance#
A Ray object has both data (the value returned when calling
metadata (e.g., the location of the value). Data is stored in the Ray object
store while the metadata is stored at the object’s owner. The owner of an
object is the worker process that creates the original
ObjectRef, e.g., by
ray.put(). Note that this worker is usually a
distinct process from the worker that creates the value of the object,
except in cases of
import ray import numpy as np @ray.remote def large_array(): return np.zeros(int(1e5)) x = ray.put(1) # The driver owns x and also creates the value of x. y = large_array.remote() # The driver is the owner of y, even though the value may be stored somewhere else. # If the node that stores the value of y dies, Ray will automatically recover # it by re-executing the large_array task. # If the driver dies, anyone still using y will receive an OwnerDiedError.
Ray can automatically recover from data loss but not owner failure.
Recovering from data loss#
When an object value is lost from the object store, such as during node failures, Ray will use lineage reconstruction to recover the object. Ray will first automatically attempt to recover the value by looking for copies of the same object on other nodes. If none are found, then Ray will automatically recover the value by re-executing the task that previously created the value. Arguments to the task are recursively reconstructed through the same mechanism.
Lineage reconstruction currently has the following limitations:
The object, and any of its transitive dependencies, must have been generated by a task (actor or non-actor). This means that objects created by ray.put are not recoverable.
Tasks are assumed to be deterministic and idempotent. Thus, by default, objects created by actor tasks are not reconstructable. To allow reconstruction of actor task results, set the
max_task_retriesparameter to a non-zero value (see actor fault tolerance for more details).
Tasks will only be re-executed up to their maximum number of retries. By default, a non-actor task can be retried up to 3 times and an actor task cannot be retried. This can be overridden with the
max_retriesparameter for remote functions and the
max_task_retriesparameter for actors.
The owner of the object must still be alive (see below).
Lineage reconstruction can cause higher than usual driver memory
usage because the driver keeps the descriptions of any tasks that may be
re-executed in case of failure. To limit the amount of memory used by
lineage, set the environment variable
RAY_max_lineage_bytes (default 1GB)
to evict lineage if the threshold is exceeded.
To disable lineage reconstruction entirely, set the environment variable
ray start or
ray.init. With this
setting, if there are no copies of an object left, an
Recovering from owner failure#
The owner of an object can die because of node or worker process failure.
Currently, Ray does not support recovery from owner failure. In this case, Ray
will clean up any remaining copies of the object’s value to prevent a memory
leak. Any workers that subsequently try to get the object’s value will receive
OwnerDiedError exception, which can be handled manually.
Ray throws an
ObjectLostError to the application when an object cannot be
retrieved due to application or system error. This can occur during a
ray.get() call or when fetching a task’s arguments, and can happen for a
number of reasons. Here is a guide to understanding the root cause for
different error types:
OwnerDiedError: The owner of an object, i.e., the Python worker that first created the
ray.put(), has died. The owner stores critical object metadata and an object cannot be retrieved if this process is lost.
ObjectReconstructionFailedError: This error is thrown if an object, or another object that this object depends on, cannot be reconstructed due to one of the limitations described above.
ReferenceCountingAssertionError: The object has already been deleted, so it cannot be retrieved. Ray implements automatic memory management through distributed reference counting, so this error should not happen in general. However, there is a known edge case that can produce this error.
ObjectFetchTimedOutError: A node timed out while trying to retrieve a copy of the object from a remote node. This error usually indicates a system-level bug. The timeout period can be configured using the
RAY_fetch_fail_timeout_millisecondsenvironment variable (default 10 minutes).
ObjectLostError: The object was successfully created, but no copy is reachable. This is a generic error thrown when lineage reconstruction is disabled and all copies of the object are lost from the cluster.