RPC Fault Tolerance#
All RPCs added to Ray Core should be fault tolerant and use the retryable gRPC client. Ideally, they should be idempotent, or at the very least, the lack of idempotency should be documented and the client must be able to take retries into account. If you aren’t familiar with what idempotency is, consider a function that writes “hello” to a file. On retry, it writes “hello” again, resulting in “hellohello”. This isn’t idempotent. To make it idempotent, you could check the file contents before writing “hello” again, ensuring that the observable state after multiple identical function calls is the same as after a single call.
This guide walks you through a case study of a RPC that wasn’t fault tolerant or idempotent, and how it was fixed. By the end of this guide, you should understand what to look for when adding new RPCs and which testing methods to use to verify fault tolerance.
Case study: RequestWorkerLease#
Problem#
Prior to the fix described here, RequestWorkerLease could not be made retryable because
its handler in the Raylet was not idempotent.
This was because once leases were granted, they were considered occupied until ReturnWorker
was called. Until this RPC was called, the worker and its resources were never returned to the pool
of available workers and resources. The raylet assumed that the original RPC and its retry were
both fresh lease requests and couldn’t deduplicate them.
For example, consider the following sequence of operations:
Request a new worker lease (Owner → Raylet) through
RequestWorkerLease.Response is lost (Raylet → Owner).
Retry
RequestWorkerLeasefor lease (Owner → Raylet).Two sets of resources and workers are now granted, one for the original AND retry.
On the retry, the raylet should detect that the lease request is a retry and forward the already leased worker address to the owner so a second lease isn’t granted.
Solution#
To implement idempotency, a unique identifier called LeaseID was added in
PR #55469, which allowed for the deduplication
of incoming lease requests. Once leases are granted, they’re tracked in a leased_workers map
which maps lease IDs to workers. If the new lease request is already present in the
leased_workers map, the system knows this lease request is a retry and responds with the
already leased worker address.
Retryable gRPC client#
The retryable gRPC client was updated during the RPC fault tolerance project. This section describes how it works and some gotchas to watch out for.
For a basic introduction, read the retryable_grpc_client.h comment.
How it works#
The retryable gRPC client works as follows:
RPCs are sent using the retryable gRPC client.
If the client encounters a gRPC transient network error, it pushes the callback into a queue.
Several checks are done on a periodic basis:
Cheap gRPC channel state check: This checks the state of the gRPC channel to see whether the system can start sending messages again. This check happens every second by default, but is configurable through check_channel_status_interval_milliseconds.
Potentially expensive GCS node status check: If the exponential backoff period has passed and the channel is still down, the system calls server_unavailable_timeout_callback_. This callback is set in the client pool classes (raylet_client_pool, core_worker_client_pool). It checks if the client is subscribed for node status updates, and then checks the local subscriber cache to see whether a node death notification from the GCS has been received. If the client isn’t subscribed or if there’s no status for the node in the cache, it makes a RPC to the GCS. Note that for the GCS client, the
server_unavailable_timeout_callback_kills the process once called. This happens aftergcs_rpc_server_reconnect_timeout_sseconds (60 by default).Per-RPC timeout check: There’s a timeout check that’s customizable per RPC, but it’s functionally disabled because it’s always set to -1 (infinity) for each RPC.
With each additional failed RPC, the exponential backoff period is increased, agnostic of the type of RPC that fails. The backoff period caps out to a max that you can customize for the core worker and raylet clients using either the
core_worker_rpc_server_reconnect_timeout_max_sorraylet_rpc_server_reconnect_timeout_max_sconfig options. The GCS client doesn’t have a max backoff period as noted above.Once the channel check succeeds, the exponential backoff period is reset and all RPCs in the queue are retried.
If the system successfully receives a node death notification (either through subscription or querying the GCS directly), it destroys the RPC client, which posts each callback to the I/O context with a gRPC Disconnected error.
Important considerations#
A few important points to keep in mind:
Per-client queuing: Each retryable gRPC client is unique to the client (
WorkerIDfor core worker clients,NodeIDfor raylet clients), not to the type of RPC. If you first submit RPC A that fails due to a transient network error, then RPC B to the same client that fails due to a transient network error, the queue will have two items: RPC A then RPC B. There isn’t a separate queue on an RPC basis, but on a client basis.Client-level timeouts: Each timeout needs to wait for the previous timeout to complete. If both RPC A and RPC B are submitted in short succession, then RPC A will wait in total for 1 second, and RPC B will wait in total for 1 + 2 = 3 seconds. Different RPCs don’t matter and are treated the same. The reasoning is that transient network errors aren’t RPC specific. If RPC A sees a network failure, you can assume that RPC B, if sent to the same client, will experience the same failure. Hence, the time that an RPC waits is the sum of the timeouts of all the previous RPCs in the queue and its own timeout.
Destructor behavior: In the destructor for
RetryableGrpcClient, the system fails all pending RPCs by posting their I/O contexts. These callbacks should ideally never modify state held by the client classes such asRayletClient. If absolutely necessary, they must check if the client is still alive somehow, such as using a weak pointer. An example of this is in PR #58744. The application code should also take into account the Disconnected error.
Testing RPC fault tolerance#
Ray Core has three layers of testing for RPC fault tolerance and idempotency.
C++ unit tests#
For each RPC, there should be some form of C++ idempotency test that calls the HandleX server
function twice and checks that the same result is outputted each time. Different state changes
between the HandleX server function calls should be taken into account. For example, in
RequestWorkerLease, a C++ unit test was written to model the situation where the retry comes
while the initial lease request is stuck in the args pulling stage.
Python integration tests#
For each RPC, there should ideally be a Python integration test if it’s straightforward. For some RPCs, it’s challenging to test them fully deterministically using Python APIs, so having sufficient C++ unit testing can act as a good proxy. Hence, it’s more of a nice-to-have, as integration tests also act as examples of how a user could run into idempotency issues.
The main testing mechanism uses the RAY_testing_rpc_failure config option, which allows
you to:
Trigger the RPC callback immediately with a gRPC error without sending the RPC (simulating a request failure).
Trigger the RPC callback with a gRPC error once the response arrives from the server (simulating a response failure).
Trigger the RPC callback immediately with a gRPC error but send the RPC to the server as well (simulating an in-flight failure, where the retry should ideally hit the server while it’s executing the server code for long-polling RPCs).
For more details, see the comment in the Ray config file at ray_config_def.h.
Chaos network release tests#
IP table blackout#
The IP table blackout approach involves SSH-ing into each node and blacking out the IP tables for a small amount of time (5 seconds) to simulate transient network errors. The IP table script runs in the background, periodically (60 seconds) causing network blackouts while the test script executes.
For core release tests, we’ve added the IP table blackout approach to all existing chaos release tests in PR #58868.
Note
Initially, Amazon FIS was considered. However, it has a 60-second minimum which caused node death due to configuration settings which was hard to debug, so the IP table approach was simpler and more flexible to use.