Wrappers¶
Environment wrapper for monitoring the training process. |
|
Wrapper that does frame stacking (see DQN paper). |
|
This wrapper decompactifies a |
|
This wrapper splits a |
|
Wrap a gymnasium-style environment such that it may be used by a meta-policy, i.e. a bandit that selects a policy (an arm), which is then used to sample a lower-level action and fed the original environment. |
Gymnasium provides a nice modular interface to extend existing using environment wrappers. Here we list some wrappers that are used throughout the coax package.
The most notable wrapper that you’ll probably want to use is
coax.wrappers.TrainMonitor
. It wraps the environment in a way that we
can view our training logs easily. It uses both the standard logging
module as well as tensorboard through the tensorboardX package.
Object Reference¶
- class coax.wrappers.TrainMonitor(env, tensorboard_dir=None, tensorboard_write_all=False, log_all_metrics=False, smoothing=10, **logger_kwargs)[source]¶
Environment wrapper for monitoring the training process.
This wrapper logs some diagnostics at the end of each episode and it also gives us some handy attributes (listed below).
- Parameters:
env (gymnasium environment) – A gymnasium environment.
tensorboard_dir (str, optional) –
If provided, TrainMonitor will log all diagnostics to be viewed in tensorboard. To view these, point tensorboard to the same dir:
$ tensorboard --logdir {tensorboard_dir}
tensorboard_write_all (bool, optional) – You may record your training metrics using the
record_metrics
method. Setting thetensorboard_write_all
specifies whether to pass the metrics on to tensorboard immediately (True
) or to wait and average them across the episode (False
). The default setting (False
) prevents tensorboard from being fluided by logs.log_all_metrics (bool, optional) – Whether to log all metrics. If
log_all_metrics=False
, only a reduced set of metrics are logged.smoothing (positive int, optional) –
The number of observations for smoothing the metrics. We use the following smooth update rule:
\[\begin{split}n\ &\leftarrow\ \min(\text{smoothing}, n + 1) \\ x_\text{avg}\ &\leftarrow\ x_\text{avg} + \frac{x_\text{obs} - x_\text{avg}}{n}\end{split}\]**logger_kwargs – Keyword arguments to pass on to
coax.utils.enable_logging()
.
- Variables:
T (positive int) – Global step counter. This is not reset by
env.reset()
, useenv.reset_global()
instead.ep (positive int) – Global episode counter. This is not reset by
env.reset()
, useenv.reset_global()
instead.t (positive int) – Step counter within an episode.
G (float) – The return, i.e. amount of reward accumulated from the start of the current episode.
avg_G (float) – The average return G, averaged over the past 100 episodes.
dt_ms (float) – The average wall time of a single step, in milliseconds.
- close()¶
Closes the wrapper and
env
.
- get_counters()[source]¶
Get the current state of all internal counters.
- Returns:
counter (dict) – The dict that contains the counters.
- get_metrics()[source]¶
Return the current state of the metrics.
- Returns:
metrics (dict) – A dict of metrics, of type
{name <str>: value <float>}
.
- load_counters(filepath)[source]¶
Restore the state of all internal counters.
- Parameters:
filepath (str) – The checkpoint file path.
- record_metrics(metrics)[source]¶
Record metrics during the training process.
These are used to print more diagnostics.
- Parameters:
metrics (dict) – A dict of metrics, of type
{name <str>: value <float>}
.
- render() RenderFrame | list[RenderFrame] | None ¶
Uses the
render()
of theenv
that can be overwritten to change the returned data.
- save_counters(filepath)[source]¶
Store the current state of all internal counters.
- Parameters:
filepath (str) – The checkpoint file path.
- set_counters(counters)[source]¶
Restore the state of all internal counters.
- Parameters:
counter (dict) – The dict that contains the counters.
- property action_space: spaces.Space[ActType] | spaces.Space[WrapperActType]¶
Return the
Env
action_space
unless overwritten then the wrapperaction_space
is used.
- property observation_space: spaces.Space[ObsType] | spaces.Space[WrapperObsType]¶
Return the
Env
observation_space
unless overwritten then the wrapperobservation_space
is used.
- property render_mode: str | None¶
Returns the
Env
render_mode
.
- property reward_range: tuple[SupportsFloat, SupportsFloat]¶
Return the
Env
reward_range
unless overwritten then the wrapperreward_range
is used.
- property unwrapped: Env[ObsType, ActType]¶
Returns the base environment of the wrapper.
This will be the bare
gymnasium.Env
environment, underneath all layers of wrappers.
- class coax.wrappers.FrameStacking(env, num_frames)[source]¶
Wrapper that does frame stacking (see DQN paper).
This implementation is different from most implementations in that it doesn’t perform the stacking itself. Instead, it just returns a tuple of frames (untouched), which may be stacked downstream.
The benefit of this implementation is two-fold. First, it respects the
gymnasium.spaces
API, where each observation is truly an element of the observation space (this is not true of the gymnasium implementation, which uses a custom data class to maintain its minimal memory footprint). Second, this implementation is compatibility with thejax.tree_util
module, which means that we can feed it into jit-compiled functions directly.Example
import gymnasium env = gymnasium.make('PongNoFrameskip-v0') print(env.observation_space) # Box(210, 160, 3) env = FrameStacking(env, num_frames=2) print(env.observation_space) # Tuple((Box(210, 160, 3), Box(210, 160, 3)))
- Parameters:
env (gymnasium-style environment) – The original environment to be wrapped.
num_frames (positive int) – Number of frames to stack.
- close()¶
Closes the wrapper and
env
.
- render() RenderFrame | list[RenderFrame] | None ¶
Uses the
render()
of theenv
that can be overwritten to change the returned data.
- reset(**kwargs)[source]¶
Uses the
reset()
of theenv
that can be overwritten to change the returned data.
- step(action)[source]¶
Uses the
step()
of theenv
that can be overwritten to change the returned data.
- property action_space: spaces.Space[ActType] | spaces.Space[WrapperActType]¶
Return the
Env
action_space
unless overwritten then the wrapperaction_space
is used.
- property observation_space: spaces.Space[ObsType] | spaces.Space[WrapperObsType]¶
Return the
Env
observation_space
unless overwritten then the wrapperobservation_space
is used.
- property render_mode: str | None¶
Returns the
Env
render_mode
.
- property reward_range: tuple[SupportsFloat, SupportsFloat]¶
Return the
Env
reward_range
unless overwritten then the wrapperreward_range
is used.
- property unwrapped: Env[ObsType, ActType]¶
Returns the base environment of the wrapper.
This will be the bare
gymnasium.Env
environment, underneath all layers of wrappers.
- class coax.wrappers.BoxActionsToReals(env)[source]¶
This wrapper decompactifies a
Box
action space to the reals. This is required in order to be able to use a Gaussian policy.In practice, the wrapped environment expects the input action \(a_\text{real}\in\mathbb{R}^n\) and then it compactifies it back to a Box of the right size:
\[a_\text{box}\ =\ \text{low} + (\text{high}-\text{low})\times\text{sigmoid}(a_\text{real})\]Technically, the transformed space is still a Box, but that’s only because we assume that the values lie between large but finite bounds, \(a_\text{real}\in[-10^{15}, 10^{15}]^n\).
- close()¶
Closes the wrapper and
env
.
- render() RenderFrame | list[RenderFrame] | None ¶
Uses the
render()
of theenv
that can be overwritten to change the returned data.
- reset(*, seed: int | None = None, options: dict[str, Any] | None = None) tuple[WrapperObsType, dict[str, Any]] ¶
Uses the
reset()
of theenv
that can be overwritten to change the returned data.
- property action_space: spaces.Space[ActType] | spaces.Space[WrapperActType]¶
Return the
Env
action_space
unless overwritten then the wrapperaction_space
is used.
- property observation_space: spaces.Space[ObsType] | spaces.Space[WrapperObsType]¶
Return the
Env
observation_space
unless overwritten then the wrapperobservation_space
is used.
- property render_mode: str | None¶
Returns the
Env
render_mode
.
- property reward_range: tuple[SupportsFloat, SupportsFloat]¶
Return the
Env
reward_range
unless overwritten then the wrapperreward_range
is used.
- property unwrapped: Env[ObsType, ActType]¶
Returns the base environment of the wrapper.
This will be the bare
gymnasium.Env
environment, underneath all layers of wrappers.
- class coax.wrappers.BoxActionsToDiscrete(env, num_bins, random_seed=None)[source]¶
This wrapper splits a
Box
action space into bins. The resulting action space is eitherDiscrete
orMultiDiscrete
, depending on the shape of the original action space.- Parameters:
- close()¶
Closes the wrapper and
env
.
- render() RenderFrame | list[RenderFrame] | None ¶
Uses the
render()
of theenv
that can be overwritten to change the returned data.
- reset(*, seed: int | None = None, options: dict[str, Any] | None = None) tuple[WrapperObsType, dict[str, Any]] ¶
Uses the
reset()
of theenv
that can be overwritten to change the returned data.
- property action_space: spaces.Space[ActType] | spaces.Space[WrapperActType]¶
Return the
Env
action_space
unless overwritten then the wrapperaction_space
is used.
- property observation_space: spaces.Space[ObsType] | spaces.Space[WrapperObsType]¶
Return the
Env
observation_space
unless overwritten then the wrapperobservation_space
is used.
- property render_mode: str | None¶
Returns the
Env
render_mode
.
- property reward_range: tuple[SupportsFloat, SupportsFloat]¶
Return the
Env
reward_range
unless overwritten then the wrapperreward_range
is used.
- property unwrapped: Env[ObsType, ActType]¶
Returns the base environment of the wrapper.
This will be the bare
gymnasium.Env
environment, underneath all layers of wrappers.
- class coax.wrappers.MetaPolicyEnv(env, *arms)[source]¶
Wrap a gymnasium-style environment such that it may be used by a meta-policy, i.e. a bandit that selects a policy (an arm), which is then used to sample a lower-level action and fed the original environment. In other words, the actions that the
step
method expects are meta-actions, selecting different arms. The lower-level actions (and their log-propensities) that are sampled internally are stored in theinfo
dict, returned by thestep
method.- Parameters:
env (gymnasium-style environment) – The original environment to be wrapped into a meta-policy env.
*arms (functions) – Callable objects that take a state observation \(s\) and return an action \(a\) (and optionally, log-propensity \(\log\pi(a|s)\)). See for example
coax.Policy.__call__
orcoax.Policy.mode
.
- close()¶
Closes the wrapper and
env
.
- render() RenderFrame | list[RenderFrame] | None ¶
Uses the
render()
of theenv
that can be overwritten to change the returned data.
- step(a_meta)[source]¶
Uses the
step()
of theenv
that can be overwritten to change the returned data.
- property action_space: spaces.Space[ActType] | spaces.Space[WrapperActType]¶
Return the
Env
action_space
unless overwritten then the wrapperaction_space
is used.
- property observation_space: spaces.Space[ObsType] | spaces.Space[WrapperObsType]¶
Return the
Env
observation_space
unless overwritten then the wrapperobservation_space
is used.
- property render_mode: str | None¶
Returns the
Env
render_mode
.
- property reward_range: tuple[SupportsFloat, SupportsFloat]¶
Return the
Env
reward_range
unless overwritten then the wrapperreward_range
is used.
- property unwrapped: Env[ObsType, ActType]¶
Returns the base environment of the wrapper.
This will be the bare
gymnasium.Env
environment, underneath all layers of wrappers.