Add Parallel Q-Networks algorithm (PQN) #472

roger-creus · 2024-07-17T16:55:45Z

Description

Adding PQN from Simplifying Deep Temporal Difference Learning

I have implemented both pqn.py and pqn_atari_envpool.py. The results are promising for the Cartpole version. Check them out here. I am now running some debugging experiments for the Atari version.

Some details about the implementations:

Both use envpool
Hyperaprameters try to match the configs from the official implementations but some are changed (epsilon-decay schedule matches the DQN implementation from cleanRL. I haven't checked the importance of the hyperparameter the defaults in CleanRL made more sense to me)
For comparing pqn.py and dqn.py in cartpole I multiplied the rewards from the environment by 0.1 as done in the official implementation of PQN. performance increases for both algos.
Using LayerNorm in the networks instead of allowing the user to select between Layer or Batch norm. Layer norm should work better.
Not giving the user the option to add BatchNorm to the inputs to the network (i.e. states) as in the official implementaiton.

Overall the implementation is similar to ppo with envpool (so very fast!) but with the sample-efficiency of Q-learning! Nice algorithm! :)

Let me know how to proceed!

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the tests accordingly (if applicable).
I have updated the documentation and previewed the changes via mkdocs serve.
- I have explained note-worthy implementation details.
- I have explained the logged metrics.
- I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

vercel · 2024-07-17T16:55:49Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 26, 2024 1:25pm

sdpkjc · 2024-07-17T17:14:54Z

Hey Roger, it's really cool to see you adding PQN to CleanRL! I've read the paper before, and I think your implementation is great. When it comes time to run benchmarks or add documentation, let's collaborate to see how we can best do it. Looking forward to seeing the completed PR! 🚀👍

roger-creus · 2024-07-17T18:37:26Z

I think the code might be ready to be benchmarked. These are some results in Breakout. It seems to converge to 400 score in 10M which would match DQN. The official imlpementation reports 515 score after 400M steps. Should I be added to the openrlbenchmark W&B team?

sdpkjc · 2024-07-18T02:21:26Z

I think the code might be ready to be benchmarked. These are some results in Breakout. It seems to converge to 400 score in 10M which would match DQN. The official imlpementation reports 515 score after 400M steps. Should I be added to the openrlbenchmark W&B team?

I noticed that the epsilon greedy implementation in our current setup differs from the official one, where each environment independently performs epsilon greedy exploration, whereas in our implementation, all environments share a single random number. This might have an impact when running many environments in parallel. Of course, there could be other reasons for the performance differences too. Let's start by running some benchmark tests to see if the performance also falls short in other environments. Looking forward to working through this together!

https://github.com/mttga/purejaxql/blob/9878a74439593c5d0acc8e506fefc44daa230c51/purejaxql/pqn_atari.py#L312-L325

… some envs can explore and some exploit, like in the official implementation

roger-creus · 2024-07-18T03:54:02Z

Very nice catch! Let me try to set up the benchmark experiments :)

roger-creus · 2024-07-18T13:06:51Z

Here are some first results!
I think they look pretty good but maybe in BeamRider-v5 is falling a bit short. Let me double check the implementation and run some more experiments

vwxyzjn · 2024-07-19T13:21:06Z

Been watching this from far, very cool work!!

pseudo-rnd-thoughts · 2024-07-19T13:31:09Z

Nice job, your results show it takes 25 minutes for 10 million frames while the paper reports 200 million in an hour.
Do you know why there are such significant differences in performance?

No equivalent to jax.jit or jax.lax.scan?

roger-creus · 2024-07-19T14:18:20Z

Updated results here. I wonder how should I generate the comparison between DQN/PQN with the rlops function since I am using envpool and I am not being able to compare pqn_atari_envpool.py in Breakout-v5 vs dqn_atari.py in BreakoutNoFrameskip-v4 for instance. Should I make a version of PQN that doesn't use envpool?

@pseudo-rnd-thoughts It is probably because jax.lax.scan. I am not used to coding in jax, but after searching online, it seems pytorch does not have a function like scan...

sdpkjc · 2024-07-20T07:58:43Z

Maybe try torch.compile?

roger-creus · 2024-07-25T19:14:43Z

Hey! How do you think we should proceed?

I believe that it will be hard to match the speed of the JAX-based original implementation in this torch implementation, but at least it provides a Q-learning alternative + envpool that matches CleanRL envpool PPO, which can already be very useful! :)

…150%

roger-creus · 2024-07-26T13:28:24Z

I realized I was re-computing the values for each state in the rollouts when computing Q(lambda) returns. I have now used a values buffer (as used in the PPO implementations actually) and replaced the computations in the Q(lambda) process. Performance remains the same and the code is now approx 150% faster.

Also, I added pqn_atari_lstm_envpool! First results in the atari environments show the implementation is correct. Please double check! :)

Please let me know how we should continue!

pseudo-rnd-thoughts · 2024-07-26T13:55:59Z

@roger-creus There is a larger issue of EnvPool with rollouts and computing the loss function, see #475

roger-creus added 2 commits July 17, 2024 12:32

wip

6e89a5c

wip

e9f158f

vercel bot deployed to Preview July 17, 2024 16:56 View deployment

Run pre-commit run --all-files

21226d7

vercel bot deployed to Preview July 17, 2024 18:38 View deployment

each env has its own probability of random.random() < epsilon -- i.e.…

fba8994

… some envs can explore and some exploit, like in the official implementation

vercel bot deployed to Preview July 18, 2024 03:54 View deployment

added pqn with lstm. works on breakout :)

b6a12c1

vercel bot deployed to Preview July 25, 2024 22:29 View deployment

Use buffer for values instead of re-computing in Q(lambda). Speed up …

bfe1205

…150%

vercel bot deployed to Preview July 26, 2024 13:25 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parallel Q-Networks algorithm (PQN) #472

Add Parallel Q-Networks algorithm (PQN) #472

roger-creus commented Jul 17, 2024 •

edited

Loading

vercel bot commented Jul 17, 2024 •

edited

Loading

sdpkjc commented Jul 17, 2024

roger-creus commented Jul 17, 2024 •

edited

Loading

sdpkjc commented Jul 18, 2024 •

edited

Loading

roger-creus commented Jul 18, 2024

roger-creus commented Jul 18, 2024

vwxyzjn commented Jul 19, 2024

pseudo-rnd-thoughts commented Jul 19, 2024 •

edited

Loading

roger-creus commented Jul 19, 2024

sdpkjc commented Jul 20, 2024

roger-creus commented Jul 25, 2024

roger-creus commented Jul 26, 2024

pseudo-rnd-thoughts commented Jul 26, 2024

Add Parallel Q-Networks algorithm (PQN) #472

Are you sure you want to change the base?

Add Parallel Q-Networks algorithm (PQN) #472

Conversation

roger-creus commented Jul 17, 2024 • edited Loading

Description

Types of changes

Checklist:

vercel bot commented Jul 17, 2024 • edited Loading

sdpkjc commented Jul 17, 2024

roger-creus commented Jul 17, 2024 • edited Loading

sdpkjc commented Jul 18, 2024 • edited Loading

roger-creus commented Jul 18, 2024

roger-creus commented Jul 18, 2024

vwxyzjn commented Jul 19, 2024

pseudo-rnd-thoughts commented Jul 19, 2024 • edited Loading

roger-creus commented Jul 19, 2024

sdpkjc commented Jul 20, 2024

roger-creus commented Jul 25, 2024

roger-creus commented Jul 26, 2024

pseudo-rnd-thoughts commented Jul 26, 2024

roger-creus commented Jul 17, 2024 •

edited

Loading

vercel bot commented Jul 17, 2024 •

edited

Loading

roger-creus commented Jul 17, 2024 •

edited

Loading

sdpkjc commented Jul 18, 2024 •

edited

Loading

pseudo-rnd-thoughts commented Jul 19, 2024 •

edited

Loading