You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the environment is initialized or reset, the assignment of agents to player positions in the environment is randomized. However, this operation applies only to the ordering of actions and observations in the step interface - it does not affect the environment info returned by the step interface, including shaped and sparse rewards. As a result, intuitive use of the interface will lead to misattribution of rewards 50% of the time.
The agent_idx flag is returned with the environment info, which seems like it's intended to give the user a workaround to this problem by allowing them to reinterpret the order of the environment info outside of the environment interface. However, this does not appear to be documented, so users have no way of knowing they need to do this. Users therefore have to reverse engineer source code to understand why their rewards are misattributed, if they catch the issue at all.
Ideas for improving this:
reorder the environment info, so that the shuffle operation applies to the entire step interface.
introduce an environment parameter to enable/disable shuffling of agents, and issue a warning if using the shuffling feature while it does not apply completely to the step interface.
write thorough docs/demos so users have a reasonable way to understand and avoid this pitfall.
The text was updated successfully, but these errors were encountered:
Hi @bgiddens, thanks for pointing this out – you're totally right that it's tough to realize that this is happening under the hood right now.
reorder the environment info, so that the shuffle operation applies to the entire step interface.
This would not be backwards compatible, so maybe this is the change that I'd be most worried about.
Regarding the other two solutions, both seem like good ideas. I currently don't have time to implement them, but given that there seems to be some interest in this issue, if anyone wants to take a stab at doing either of those (or both), I'll happily review a PR!
When the environment is initialized or reset, the assignment of agents to player positions in the environment is randomized. However, this operation applies only to the ordering of actions and observations in the
step
interface - it does not affect the environment info returned by thestep
interface, including shaped and sparse rewards. As a result, intuitive use of the interface will lead to misattribution of rewards 50% of the time.The
agent_idx
flag is returned with the environment info, which seems like it's intended to give the user a workaround to this problem by allowing them to reinterpret the order of the environment info outside of the environment interface. However, this does not appear to be documented, so users have no way of knowing they need to do this. Users therefore have to reverse engineer source code to understand why their rewards are misattributed, if they catch the issue at all.Ideas for improving this:
step
interface.step
interface.The text was updated successfully, but these errors were encountered: