Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A simple attempt at distributed training #22

Open
Bobingstern opened this issue Dec 5, 2022 · 2 comments
Open

A simple attempt at distributed training #22

Bobingstern opened this issue Dec 5, 2022 · 2 comments

Comments

@Bobingstern
Copy link

Bobingstern commented Dec 5, 2022

Could it be possible to have multiple machines all generate training games over the same network and send the games generated to a "master" machine which will use the training data and train a new version of the model and then send the new model (after applying gating if enabled) back to the other machines and start over again? I'm 99% sure I can implement this using sockets fairly easily but I need to know a few things.

  • How are training samples stored and sampled for training
  • How can I "merge" training samples from multiple files into the base 3 .pkl files (data, policy, value)

I am reluctant to use Ray for this since it sounds like overkill for a simple task of generation and file transfer. Scheduling would be very straightforward, just count how many samples have been transfered and once they reach a threshold of say 1 mil, tell the other machines to stop generation and start training on the master machine. After this send the (gated) model back and have the machines run baseline testing if needed.

I think this is basically what Lc0 does but with distributed training as well which would probably need Ray

@GrandVoid
Copy link

Hey I dont really have an answer, but a question instead. What do I do when this happens? | _ | |
|_ |_ |_| v0.29.0 built Dec 13 2022
id name Lc0 v0.29.0
id author The LCZero Authors.
Loading weights file from: client-cache\46479f32cf5cf7bde9dfde78b7a33be50c3179a2497ebf518b746fb9b13fea62
Unhandled exception: Invalid weight file: lc0 version >= 0.30.0 required.
2023/05/29 17:20:20 lc0_main.go:732: GameInfo channel closed, exiting train loop
2023/05/29 17:20:20 lc0_main.go:749: Waiting for lc0 to stop
lc0 exited with: exit status 32212265052023/05/29 17:20:20 lc0_main.go:754: lc0 stopped
2023/05/29 17:20:20 lc0_main.go:756: Waiting for uploads to complete
2023/05/29 17:20:20 lc0_main.go:1255: Client self-exited without producing any games.
2023/05/29 17:20:20 lc0_main.go:1256: Sleeping for 30 seconds...

@Bobingstern
Copy link
Author

You should ask the LCZero discord server. It makes no sense to ask here since this is not related to kevaday's AlphaZero

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants