Note
Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The team is currently transitioning algorithms, example scripts, and documentation to the new code base throughout the subsequent minor releases leading up to Ray 3.0.
See here for more details on how to activate and use the new API stack.
Hierarchical Environments#
You can implement hierarchical training as a special case of multi-agent RL. For example, consider a two-level hierarchy of policies, where a top-level policy issues high level tasks that are executed at a finer timescale by one or more low-level policies. The following timeline shows one step of the top-level policy, which corresponds to four low-level actions:
top-level: action_0 -------------------------------------> action_1 ->
low-level: action_0 -> action_1 -> action_2 -> action_3 -> action_4 ->
Alternatively, you could implement an environment, in which the two agent types don’t act at the same time (overlappingly), but the low-level agents wait for the high-level agent to issue an action, then act n times before handing back control to the high-level agent:
top-level: action_0 -----------------------------------> action_1 ->
low-level: ---------> action_0 -> action_1 -> action_2 ------------>
You can implement any of these hierarchical action patterns as a multi-agent environment with various types of agents, for example a high-level agent and a low-level agent. When set up using the correct agent to module mapping functions, from RLlib’s perspective, the problem becomes a simple independent multi-agent problem with different types of policies.
Your configuration might look something like the following:
from ray.rllib.algorithms.ppo import PPOConfig
config = (
PPOConfig()
.multi_agent(
policies={"top_level", "low_level"},
policy_mapping_fn=(
lambda aid, eps, **kw: "low_level" if aid.startswith("low_level") else "top_level"
),
policies_to_train=["top_level"],
)
)
In this setup, the appropriate rewards at any hierarchy level should be provided by the multi-agent env implementation. The environment class is also responsible for routing between agents, for example conveying goals from higher-level agents to lower-level agents as part of the lower-level agent observation.