Implementing a custom tensor transport (Advanced)#
Ray Direct Transport (RDT) allows you to register custom tensor transports at runtime.
This page explains how to implement a custom tensor transport by implementing the TensorTransportManager abstract interface.
Overview#
To create a custom tensor transport:
Implement the abstract interface
ray.experimental.TensorTransportManager.Define custom metadata classes by extending
TensorTransportMetadataandCommunicatorMetadata.Register your transport using
ray.experimental.register_tensor_transport.
When Ray needs to transfer a tensor between actors using your transport, it calls specific methods on your TensorTransportManager implementation at different stages of the transfer lifecycle.
Implementing TensorTransportManager#
The TensorTransportManager abstract class defines the interface for custom tensor transports. You must implement all abstract methods.
The following diagram shows when each method is called during a tensor transfer:
Source Actor Owner Process Destination Actor
============ ============= =================
| | |
1. Task returns tensor | |
``extract_tensor_transport_metadata`` |
| | |
| ---- transport_metadata ----> | |
| | |
| 2. Prepare communicator |
| ``get_communicator_metadata`` |
| | |
| <---- comm metadata --------- | ---- comm metadata --------> |
| | |
3. ``send_multiple_tensors`` | 3. ``recv_multiple_tensors``
| |
| ------------ tensors ---------------------------------------> |
| | |
| (transfer complete) |
| | |
| 5. Ref goes out of scope |
| <---------------------------- | |
5. Clean up resources | |
``garbage_collect`` | |
Note that Ray will not call send_multiple_tensors for one-sided transports.
The following diagram shows where each method is called in the ray.put / ray.get case supported by one-sided transports.
Source Actor Destination Actor
============ =================
| |
1. User ``ray.put``'s tensor |
``extract_tensor_transport_metadata`` |
| |
| |
2. User passes ref to another actor |
| ---- transport_metadata ----------------------------------> |
| |
| |
| 3. User ``ray.get``'s on object ref
``get_communicator_metadata``
| ``recv_multiple_tensors``
| ------------ tensors --------- -----------------------------> |
| |
| (transfer complete) |
| |
4. Clean up resources |
``garbage_collect`` |
(when ref goes out of scope) |
The API reference page for TensorTransportManager has more details on what each method does and how to implement them.
See implementations of Ray’s default transports (NCCL, NIXL, etc.) in the python/ray/experimental/rdt/ directory.
The following is an walk-through for implementing and using a custom tensor transport.
Registering your transport#
After implementing your transport, the driver process must register it with ray.experimental.register_tensor_transport before creating any actors that use it:
register_tensor_transport(
"shared_memory", # Transport name
["cpu"], # Supported device types
SharedMemoryTransport, # TensorTransportManager class
numpy.ndarray, # Data type for this transport
)
@ray.remote
class MyActor:
@ray.method(tensor_transport="shared_memory")
def echo(self, data):
return data
def sum(self, data):
return data.sum().item()
actors = [MyActor.remote() for _ in range(2)]
ref = actors[0].echo.remote(numpy.array([1, 2, 3]))
result = actors[1].sum.remote(ref)
print(ray.get(result))
# 6
Limitations#
Custom tensor transports have the following limitations:
Actor restarts aren’t supported. Your actor doesn’t have access to the custom transport after a restart.
Register transports before actor creation. If you register a transport after creating an actor, that actor can’t use the new transport.
Out-of-order actors If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can’t guarantee it has registered your custom transport on the actor at task execution time.
Actor creation and task submission from different processes If the process where you submit an actor task is different from where you created the actor, Ray can’t guarantee it has registered your custom transport on the actor at task execution time.
For general RDT limitations, see limitations.
Also feel free to reach out through GitHub issues or the Ray Slack to ask any questions.