-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parallel Q-Networks algorithm (PQN) #472
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Hey Roger, it's really cool to see you adding PQN to CleanRL! I've read the paper before, and I think your implementation is great. When it comes time to run benchmarks or add documentation, let's collaborate to see how we can best do it. Looking forward to seeing the completed PR! 🚀👍 |
I noticed that the epsilon greedy implementation in our current setup differs from the official one, where each environment independently performs epsilon greedy exploration, whereas in our implementation, all environments share a single random number. This might have an impact when running many environments in parallel. Of course, there could be other reasons for the performance differences too. Let's start by running some benchmark tests to see if the performance also falls short in other environments. Looking forward to working through this together! |
… some envs can explore and some exploit, like in the official implementation
Very nice catch! Let me try to set up the benchmark experiments :) |
Here are some first results! |
Been watching this from far, very cool work!! |
Nice job, your results show it takes 25 minutes for 10 million frames while the paper reports 200 million in an hour. No equivalent to |
Updated results here. I wonder how should I generate the comparison between DQN/PQN with the @pseudo-rnd-thoughts It is probably because |
Maybe try |
Hey! How do you think we should proceed? I believe that it will be hard to match the speed of the JAX-based original implementation in this torch implementation, but at least it provides a Q-learning alternative + envpool that matches CleanRL envpool PPO, which can already be very useful! :) |
I realized I was re-computing the values for each state in the rollouts when computing Also, I added Please let me know how we should continue! |
@roger-creus There is a larger issue of EnvPool with rollouts and computing the loss function, see #475 |
Description
Adding PQN from Simplifying Deep Temporal Difference Learning
I have implemented both
pqn.py
andpqn_atari_envpool.py
. The results are promising for the Cartpole version. Check them out here. I am now running some debugging experiments for the Atari version.Some details about the implementations:
pqn.py
anddqn.py
in cartpole I multiplied the rewards from the environment by 0.1 as done in the official implementation of PQN. performance increases for both algos.Overall the implementation is similar to ppo with envpool (so very fast!) but with the sample-efficiency of Q-learning! Nice algorithm! :)
Let me know how to proceed!
Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.If you need to run benchmark experiments for a performance-impacting changes:
--capture_video
.python -m openrlbenchmark.rlops
.python -m openrlbenchmark.rlops
utility to the documentation.python -m openrlbenchmark.rlops ....your_args... --report
, to the documentation.