The Policy class contains functionality to compute actions for decision making in an environment, as well as computing loss(es) and gradients, updating a neural network model as well as postprocessing a collected environment trajectory. One or more Policy objects sit inside a RolloutWorker’s PolicyMap and are - if more than one - are selected based on a multi-agent policy_mapping_fn, which maps agent IDs to a policy ID.


RLlib’s Policy class hierarchy: Policies are deep-learning framework specific as they hold functionality to handle a computation graph (e.g. a TensorFlow 1.x graph in a session). You can define custom policy behavior by sub-classing either of the available, built-in classes, depending on your needs.