ray.rllib.algorithms.algorithm_config.AlgorithmConfig.evaluation#
- AlgorithmConfig.evaluation(*, evaluation_interval: int | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_duration: int | str | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_duration_unit: str | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_sample_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_parallel_to_training: bool | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_force_reset_envs_before_iteration: bool | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_config: ~ray.rllib.algorithms.algorithm_config.AlgorithmConfig | dict | None = <ray.rllib.utils.from_config._NotProvided object>, off_policy_estimation_methods: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, ope_split_batch_by_episode: bool | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_num_env_runners: int | None = <ray.rllib.utils.from_config._NotProvided object>, custom_evaluation_function: ~typing.Callable | None = <ray.rllib.utils.from_config._NotProvided object>, always_attach_evaluation_results=-1, evaluation_num_workers=-1) AlgorithmConfig [source]#
Sets the config’s evaluation settings.
- Parameters:
evaluation_interval – Evaluate with every
evaluation_interval
training iterations. The evaluation stats are reported under the “evaluation” metric key. Set to None (or 0) for no evaluation.evaluation_duration – Duration for which to run evaluation each
evaluation_interval
. The unit for the duration can be set viaevaluation_duration_unit
to either “episodes” (default) or “timesteps”. If using multiple evaluation workers (EnvRunners) in theevaluation_num_env_runners > 1
setting, the amount of episodes/timesteps to run are split amongst these. A special value of “auto” can be used in caseevaluation_parallel_to_training=True
. This is the recommended way when trying to save as much time on evaluation as possible. The Algorithm then runs as many timesteps via the evaluation workers as possible, while not taking longer than the parallely running training step and thus, never wasting any idle time on either training- or evaluation workers. When using this setting (evaluation_duration="auto"
), it is strongly advised to setevaluation_interval=1
andevaluation_force_reset_envs_before_iteration=True
at the same time.evaluation_duration_unit – The unit, with which to count the evaluation duration. Either “episodes” (default) or “timesteps”. Note that this setting is ignored if
evaluation_duration="auto"
.evaluation_sample_timeout_s – The timeout (in seconds) for evaluation workers to sample a complete episode in the case your config settings are:
evaluation_duration != auto
andevaluation_duration_unit=episode
. After this time, the user receives a warning and instructions on how to fix the issue.evaluation_parallel_to_training – Whether to run evaluation in parallel to the
Algorithm.training_step()
call, using threading. Default=False. E.g. for evaluation_interval=1 -> In every call toAlgorithm.train()
, theAlgorithm.training_step()
andAlgorithm.evaluate()
calls run in parallel. Note that this setting - albeit extremely efficient b/c it wastes no extra time for evaluation - causes the evaluation results to lag one iteration behind the rest of the training results. This is important when picking a good checkpoint. For example, if iteration 42 reports a good evaluationepisode_return_mean
, be aware that these results were achieved on the weights trained in iteration 41, so you should probably pick the iteration 41 checkpoint instead.evaluation_force_reset_envs_before_iteration – Whether all environments should be force-reset (even if they are not done yet) right before the evaluation step of the iteration begins. Setting this to True (default) makes sure that the evaluation results aren’t polluted with episode statistics that were actually (at least partially) achieved with an earlier set of weights. Note that this setting is only supported on the new API stack w/ EnvRunners and ConnectorV2 (
config.enable_rl_module_and_learner=True
ANDconfig.enable_env_runner_and_connector_v2=True
).evaluation_config – Typical usage is to pass extra args to evaluation env creator and to disable exploration by computing deterministic actions. IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here results in the evaluation workers not using this optimal policy!
off_policy_estimation_methods – Specify how to evaluate the current policy, along with any optional config parameters. This only has an effect when reading offline experiences (“input” is not “sampler”). Available keys: {ope_method_name: {“type”: ope_type, …}} where
ope_method_name
is a user-defined string to save the OPE results under, andope_type
can be any subclass of OffPolicyEstimator, e.g. ray.rllib.offline.estimators.is::ImportanceSampling or your own custom subclass, or the full class path to the subclass. You can also add additional config arguments to be passed to the OffPolicyEstimator in the dict, e.g. {“qreg_dr”: {“type”: DoublyRobust, “q_model_type”: “qreg”, “k”: 5}}ope_split_batch_by_episode – Whether to use SampleBatch.split_by_episode() to split the input batch to episodes before estimating the ope metrics. In case of bandits you should make this False to see improvements in ope evaluation speed. In case of bandits, it is ok to not split by episode, since each record is one timestep already. The default is True.
evaluation_num_env_runners – Number of parallel EnvRunners to use for evaluation. Note that this is set to zero by default, which means evaluation is run in the algorithm process (only if
evaluation_interval
is not 0 or None). If you increase this, also increases the Ray resource usage of the algorithm since evaluation workers are created separately from those EnvRunners used to sample data for training.custom_evaluation_function – Customize the evaluation method. This must be a function of signature (algo: Algorithm, eval_workers: EnvRunnerGroup) -> (metrics: dict, env_steps: int, agent_steps: int) (metrics: dict if
enable_env_runner_and_connector_v2=True
), whereenv_steps
andagent_steps
define the number of sampled steps during the evaluation iteration. See the Algorithm.evaluate() method to see the default implementation. The Algorithm guarantees all eval workers have the latest policy state before this function is called.
- Returns:
This updated AlgorithmConfig object.