Stuck when training in MsPacman-v0 #31

KarlXing · 2018-10-27T23:08:05Z

Hi @dgriff777 . Thank you for your repo. It's great that it can achieve such a high score. But I met a problem when I try to apply it to MsPacman-v0.

I simply used this command python main.py --env MsPacman-v0 --workers 7
Then, I get the test score like this:

2018-10-27 15:59:44,767 : lr: 0.0001
2018-10-27 15:59:44,767 : gamma: 0.99
2018-10-27 15:59:44,767 : tau: 1.0
2018-10-27 15:59:44,767 : seed: 1
2018-10-27 15:59:44,767 : workers: 7
2018-10-27 15:59:44,767 : num_steps: 20
2018-10-27 15:59:44,767 : max_episode_length: 10000
2018-10-27 15:59:44,767 : env: MsPacman-v0
2018-10-27 15:59:44,767 : env_config: config.json
2018-10-27 15:59:44,767 : shared_optimizer: True
2018-10-27 15:59:44,767 : load: False
2018-10-27 15:59:44,767 : save_max: True
2018-10-27 15:59:44,767 : optimizer: Adam
2018-10-27 15:59:44,767 : load_model_dir: trained_models/
2018-10-27 15:59:44,767 : save_model_dir: trained_models/
2018-10-27 15:59:44,767 : log_dir: logs/
2018-10-27 15:59:44,767 : gpu_ids: [-1]
2018-10-27 15:59:44,767 : amsgrad: True
2018-10-27 15:59:44,767 : skip_rate: 4
2018-10-27 15:59:52,746 : Time 00h 00m 07s, episode reward 60.0, episode length 429, reward mean 60.0000
2018-10-27 16:00:17,886 : Time 00h 00m 32s, episode reward 70.0, episode length 619, reward mean 65.0000
2018-10-27 16:00:43,513 : Time 00h 00m 58s, episode reward 70.0, episode length 628, reward mean 66.6667
2018-10-27 16:01:09,034 : Time 00h 01m 24s, episode reward 70.0, episode length 633, reward mean 67.5000
2018-10-27 16:01:34,687 : Time 00h 01m 49s, episode reward 70.0, episode length 615, reward mean 68.0000
2018-10-27 16:02:00,366 : Time 00h 02m 15s, episode reward 70.0, episode length 641, reward mean 68.3333
2018-10-27 16:02:25,238 : Time 00h 02m 40s, episode reward 70.0, episode length 624, reward mean 68.5714
2018-10-27 16:02:50,496 : Time 00h 03m 05s, episode reward 70.0, episode length 622, reward mean 68.7500
2018-10-27 16:03:15,714 : Time 00h 03m 30s, episode reward 70.0, episode length 631, reward mean 68.8889
2018-10-27 16:03:40,280 : Time 00h 03m 55s, episode reward 70.0, episode length 626, reward mean 69.0000
2018-10-27 16:04:05,072 : Time 00h 04m 20s, episode reward 70.0, episode length 628, reward mean 69.0909

The test score is always 70 and It seems that the agent will choose the same way every time and stop at a corner.

Could you tell me how did you train the model to get 6323.01 ± 116.91 scores in MsPacman-v0? Is there any other parameters that I should set?

The text was updated successfully, but these errors were encountered:

KarlXing · 2018-10-27T23:45:32Z

I tried python3.6 main.py --env Pong-v0 --workers 32 and the problem is still the same. The environment is

pytorch 0.4.1.
Ubuntu 16.04
python3.6

root@jinwei:~/code/rl_a3c_pytorch# python3.6 main.py --env Pong-v0 --workers 32
2018-10-27 23:31:11,420 : lr: 0.0001
2018-10-27 23:31:11,421 : gamma: 0.99
2018-10-27 23:31:11,422 : tau: 1.0
2018-10-27 23:31:11,422 : seed: 1
2018-10-27 23:31:11,422 : workers: 32
2018-10-27 23:31:11,423 : num_steps: 20
2018-10-27 23:31:11,423 : max_episode_length: 10000
2018-10-27 23:31:11,423 : env: Pong-v0
2018-10-27 23:31:11,423 : env_config: config.json
2018-10-27 23:31:11,424 : shared_optimizer: True
2018-10-27 23:31:11,424 : load: False
2018-10-27 23:31:11,424 : save_max: True
2018-10-27 23:31:11,424 : optimizer: Adam
2018-10-27 23:31:11,425 : load_model_dir: trained_models/
2018-10-27 23:31:11,425 : save_model_dir: trained_models/
2018-10-27 23:31:11,425 : log_dir: logs/
2018-10-27 23:31:11,425 : gpu_ids: [-1]
2018-10-27 23:31:11,426 : amsgrad: True
2018-10-27 23:31:11,426 : skip_rate: 4
2018-10-27 23:36:01,742 : Time 00h 04m 49s, episode reward -21.0, episode length 1017, reward mean -21.0000
2018-10-27 23:38:04,266 : Time 00h 06m 51s, episode reward -21.0, episode length 1035, reward mean -21.0000
2018-10-27 23:40:25,874 : Time 00h 09m 13s, episode reward -21.0, episode length 1016, reward mean -21.0000
2018-10-27 23:42:33,040 : Time 00h 11m 20s, episode reward -21.0, episode length 997, reward mean -21.0000
2018-10-27 23:44:28,948 : Time 00h 13m 16s, episode reward -21.0, episode length 1019, reward mean -21.0000

mantoone · 2018-10-31T09:09:55Z

Did you try training longer? I was able to train Pong-v0 in about 20 minutes with a GTX 1080 Ti, but it took hours with a CPU.

By default it does not use GPU, use the --gpu-ids argument to set the gpu ids.

KarlXing · 2018-10-31T16:03:33Z

It's weird that MsPacman-v0 will stuck. I found the reason could be that some of my processes become zombie processes. (Maybe because of the lack of CPU resources). There were only 2 processes working and updating the neural network.

mantoone · 2018-10-31T21:02:06Z

I've seen similar behavior when I ran out of memory, maybe you need to reduce the number of workers?

Training MsPacman-v0 works for me on:
Ubuntu 18.04
python 3.6.5
torch 0.4.1
CPU i7-4790k, 4 cores with hyperthreading

I used the command python main.py --env MsPacman-v0 --workers 7

mantoone · 2018-10-31T21:18:09Z

I guess you can also try to compensate the smaller number of workers with smaller learning rate:

python main.py --env MsPacman-v0 --workers 7 --lr 0.00001
2018-10-31 23:13:01,099 : lr: 1e-05
2018-10-31 23:13:01,099 : gamma: 0.99
2018-10-31 23:13:01,099 : tau: 1.0
2018-10-31 23:13:01,099 : seed: 1
2018-10-31 23:13:01,099 : workers: 7
2018-10-31 23:13:01,099 : num_steps: 20
2018-10-31 23:13:01,100 : max_episode_length: 10000
2018-10-31 23:13:01,100 : env: MsPacman-v0
2018-10-31 23:13:01,100 : env_config: config.json
2018-10-31 23:13:01,100 : shared_optimizer: True
2018-10-31 23:13:01,100 : load: False
2018-10-31 23:13:01,100 : save_max: True
2018-10-31 23:13:01,100 : optimizer: Adam
2018-10-31 23:13:01,100 : load_model_dir: trained_models/
2018-10-31 23:13:01,100 : save_model_dir: trained_models/
2018-10-31 23:13:01,100 : log_dir: logs/
2018-10-31 23:13:01,100 : gpu_ids: [-1]
2018-10-31 23:13:01,100 : amsgrad: True
2018-10-31 23:13:01,100 : skip_rate: 4
2018-10-31 23:13:13,047 : Time 00h 00m 11s, episode reward 60.0, episode length 434, reward mean 60.0000, test steps 428
2018-10-31 23:13:43,734 : Time 00h 00m 42s, episode reward 70.0, episode length 592, reward mean 65.0000, test steps 1014
2018-10-31 23:14:14,421 : Time 00h 01m 13s, episode reward 70.0, episode length 598, reward mean 66.6667, test steps 1606
2018-10-31 23:14:45,638 : Time 00h 01m 44s, episode reward 70.0, episode length 598, reward mean 67.5000, test steps 2198
2018-10-31 23:15:16,665 : Time 00h 02m 15s, episode reward 70.0, episode length 588, reward mean 68.0000, test steps 2780
2018-10-31 23:15:47,120 : Time 00h 02m 45s, episode reward 70.0, episode length 591, reward mean 68.3333, test steps 3365
2018-10-31 23:16:14,731 : Time 00h 03m 13s, episode reward 210.0, episode length 501, reward mean 88.5714, test steps 3860
2018-10-31 23:16:41,785 : Time 00h 03m 40s, episode reward 210.0, episode length 507, reward mean 103.7500, test steps 4361
2018-10-31 23:17:08,910 : Time 00h 04m 07s, episode reward 210.0, episode length 519, reward mean 115.5556, test steps 4874

dgriff777 · 2018-11-01T19:55:47Z

7 workers is quite a small number of workers and will hinder performance and 4mins of training with lower number of workers is hardly enough time to see real improvement. Especially as the v0 environments are particularly more challenging than the more common used versions. If you have access to a setup with say one gpu and 8 cpu cores with hyperthreading you will get much better performance in terms of speed. In such a setup using 16 workers using the A3G version you should see scores of 15,000-20,000 in less than 12hrs.

Using 7 workers is probably hindering adequate exploration and reducing learning rate may help in that regard but I really suggest, if you have resources to adequately support, to use at minimum 16 workers. Especially for the v0 environments. A reduction in the tau variable would also help in exploration, which is the lambda variable in the generalized advantage function.

EngineeringAsArt · 2019-01-22T19:53:46Z

MsPacman-v0 is fairly hard for modern hardware; reaching a 5000 score occasionally took me 12 hours (100 moving window avg was 3858) with 36 agents, 12cores, 3x 1080ti Turbos, Adam --amsgrad True. Do you have any parameter suggestion to speed up convergence? I used the following parameters: lr=1e-4, gamma=0.99, tau=0.92, num_steps=20, max_episode_len=10000.

Would it make sense to implement Huber L1 loss? How would I test this on your code base?

@staticmethod
def loss(output, target, *args):
        assert isinstance(output, Variable) and isinstance(target, Variable)
        # return torch.mean(torch.sum((output - target).clamp(-1, 1) ** 2, dim=1))
        return F.smooth_l1_loss(output, target, size_average=False)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck when training in MsPacman-v0 #31

Stuck when training in MsPacman-v0 #31

KarlXing commented Oct 27, 2018

KarlXing commented Oct 27, 2018

mantoone commented Oct 31, 2018

KarlXing commented Oct 31, 2018

mantoone commented Oct 31, 2018

mantoone commented Oct 31, 2018 •

edited

Loading

dgriff777 commented Nov 1, 2018 •

edited

Loading

EngineeringAsArt commented Jan 22, 2019 •

edited

Loading

Stuck when training in MsPacman-v0 #31

Stuck when training in MsPacman-v0 #31

Comments

KarlXing commented Oct 27, 2018

KarlXing commented Oct 27, 2018

mantoone commented Oct 31, 2018

KarlXing commented Oct 31, 2018

mantoone commented Oct 31, 2018

mantoone commented Oct 31, 2018 • edited Loading

dgriff777 commented Nov 1, 2018 • edited Loading

EngineeringAsArt commented Jan 22, 2019 • edited Loading

mantoone commented Oct 31, 2018 •

edited

Loading

dgriff777 commented Nov 1, 2018 •

edited

Loading

EngineeringAsArt commented Jan 22, 2019 •

edited

Loading