ray.rllib.utils.policy.local_policy_inference#

ray.rllib.utils.policy.local_policy_inference(policy: Policy, env_id: str, agent_id: str, obs: numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple, reward: float | None = None, terminated: bool | None = None, truncated: bool | None = None, info: Dict | None = None, explore: bool = None, timestep: int | None = None) numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple[source]#

Run a connector enabled policy using environment observation.

policy_inference manages policy and agent/action connectors, so the user does not have to care about RNN state buffering or extra fetch dictionaries. Note that connectors are intentionally run separately from compute_actions_from_input_dict(), so we can have the option of running per-user connectors on the client side in a server-client deployment.

Parameters:
  • policy – Policy object used in inference.

  • env_id – Environment ID. RLlib builds environments’ trajectories internally with connectors based on this, i.e. one trajectory per (env_id, agent_id) tuple.

  • agent_id – Agent ID. RLlib builds agents’ trajectories internally with connectors based on this, i.e. one trajectory per (env_id, agent_id) tuple.

  • obs – Environment observation to base the action on.

  • reward – Reward that is potentially used during inference. If not required, may be left empty. Some policies have ViewRequirements that require this. This can be set to zero at the first inference step - for example after calling gmy.Env.reset.

  • terminatedTerminated flag that is potentially used during inference. If not required, may be left None. Some policies have ViewRequirements that require this extra information.

  • truncatedTruncated flag that is potentially used during inference. If not required, may be left None. Some policies have ViewRequirements that require this extra information.

  • info – Info that is potentially used durin inference. If not required, may be left empty. Some policies have ViewRequirements that require this.

  • explore – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep – The current (sampling) time step.

Returns:

List of outputs from policy forward pass.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.