This is a MSc dissertation project at the University of Edinburgh under supervision of Stefano Albrecht.
Multi-agent reinforcement learning has seen considerable achievements on a variety of tasks. However, suboptimal conditions involving sparse feedback and partial observability, as frequently encountered in applications, remain a significant challenge. In this thesis, we apply curiosity as exploration bonuses to such multi-agent systems and analyse their impact on a variety of cooperative and competitive tasks. In addition, we consider modified scenarios involving sparse rewards and partial observability to evaluate the influence of curiosity on these challenges.
We apply the independent Q-learning and state-of-the-art multi-agent deep deterministic policy gradient methods to these tasks with and without intrinsic rewards. Curiosity is defined using pseudo-counts of observations or relying on models to predict environment dynamics.
Our evaluation illustrates that intrinsic rewards can cause considerable instability in training without benefiting exploration. This outcome can be observed on the original tasks and against our expectation under partial observability, where curiosity is unable to alleviate the introduced instability. However, curiosity leads to significantly improved stability and converged performance when applied to policy-gradient reinforcement learning with sparse rewards. While the sparsity causes training of such methods to be highly unstable, additional intrinsic rewards assist training and agents show intended behaviour on most tasks.
This work contributes to understanding the impact of intrinsic rewards in challenging multi-agent reinforcement learning environments and will serve as a foundation for further research to expand on.
More information can be found in the MSc dissertation.
- Python, version 3.7
- Numpy, version 1.15
- Pytorch, version 1.1.0
- Matplotlib, version 3.0.3
- OpenAI Gym, version 0.10.5
- Own fork of Multi-agent Particle Environments adding partial observability and stochasticity to four tasks
The general training structure is implemented in train.py
. mape_train.py
contains the specific code to train on the multi-agent particle envirionment.
For information on all paramters, run
python3 mape_train.py --help
Similarly, the evaluation structure is implemented in eval.py
with specific multi-agent particle environment evaluation found in mape_eval.py
.
For information on all paramters, run
python3 mape_eval.py --help
The training and evaluation script mostly share parameters. The major difference are the generally deactived exploration in evaluation, activated rendering and added support to save animated gifs of evaluation runs (--save_gifs
).
As multi-agent reinforcement learning baselines, we implement the following approaches:
- Independent Q-learning (IQL) using deep Q-networks (DQNs)
- Multi-agent deep deterministic policy gradient (MADDPG)
Baseline implementations can be found in marl_algorithms
, which also includes an episodic buffer (marl_algorithms/buffer.py
) and an abstract MARL class (marl_algorithms/marl_algorithms
). Detailed READMEs for IQL and MADDPG with references to papers and open-source implementations can be found in the respective subdiretories.
We implement three variations of intrinsic rewards as exploration bonuses, which can be found in intrinsic_rewards
with an abstract intrinsic reward interface (intrinsic_rewards/intrinsic_reward.py
).
Detailed, linked READMEs can be found in the respective subdirectories of each curiosity approach.
We evaluate our approaches on the multi-agent particle environment. Instead of using the original environment, we implemented a fork introducing partial observability and stochasticity to the tasks
- cooperative communication (
simple_speaker_listener
) - cooperative navigation (
simple_spread
) - physical deception (
simple_adversary
) - predator-prey (
simple_tag
)
For more detail on the added partial observability and stochasticity, see the respective sections of the README in our fork.
Experiment scripts can easily be generated using the script_generation/script_generator.py
script with the respective environment name.
At the moment only the multi-agent particle environment is supported. Parameters for the jobscript generation can be chosen in the code lines 152 to 164. Afterwards
python3 script_generator.py mape
will generate a directory mape
containing a subdirectory with jobscripts for each scenario as well as a central jobscript mape/script.sh
which executes all scripts consecutively. Hence, only this script has to be executed to run all generated jobs.
@MastersThesis{lukas:thesis:2019,
author = {Schäfer, Lukas},
title = {{Curiosity in Multi-Agent Reinforcement Learning}},
school = {University of Edinburgh},
year = {2019},
}
Lukas Schäfer - [email protected]