A simple attempt at distributed training #22

Bobingstern · 2022-12-05T13:50:42Z

Could it be possible to have multiple machines all generate training games over the same network and send the games generated to a "master" machine which will use the training data and train a new version of the model and then send the new model (after applying gating if enabled) back to the other machines and start over again? I'm 99% sure I can implement this using sockets fairly easily but I need to know a few things.

How are training samples stored and sampled for training
How can I "merge" training samples from multiple files into the base 3 .pkl files (data, policy, value)

I am reluctant to use Ray for this since it sounds like overkill for a simple task of generation and file transfer. Scheduling would be very straightforward, just count how many samples have been transfered and once they reach a threshold of say 1 mil, tell the other machines to stop generation and start training on the master machine. After this send the (gated) model back and have the machines run baseline testing if needed.

I think this is basically what Lc0 does but with distributed training as well which would probably need Ray

GrandVoid · 2023-05-29T15:22:24Z

Hey I dont really have an answer, but a question instead. What do I do when this happens? | _ | |
|_ |_ |_| v0.29.0 built Dec 13 2022
id name Lc0 v0.29.0
id author The LCZero Authors.
Loading weights file from: client-cache\46479f32cf5cf7bde9dfde78b7a33be50c3179a2497ebf518b746fb9b13fea62
Unhandled exception: Invalid weight file: lc0 version >= 0.30.0 required.
2023/05/29 17:20:20 lc0_main.go:732: GameInfo channel closed, exiting train loop
2023/05/29 17:20:20 lc0_main.go:749: Waiting for lc0 to stop
lc0 exited with: exit status 32212265052023/05/29 17:20:20 lc0_main.go:754: lc0 stopped
2023/05/29 17:20:20 lc0_main.go:756: Waiting for uploads to complete
2023/05/29 17:20:20 lc0_main.go:1255: Client self-exited without producing any games.
2023/05/29 17:20:20 lc0_main.go:1256: Sleeping for 30 seconds...

Bobingstern · 2023-05-29T16:32:16Z

You should ask the LCZero discord server. It makes no sense to ask here since this is not related to kevaday's AlphaZero

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A simple attempt at distributed training #22

A simple attempt at distributed training #22

Bobingstern commented Dec 5, 2022 •

edited

Loading

GrandVoid commented May 29, 2023

Bobingstern commented May 29, 2023

A simple attempt at distributed training #22

A simple attempt at distributed training #22

Comments

Bobingstern commented Dec 5, 2022 • edited Loading

GrandVoid commented May 29, 2023

Bobingstern commented May 29, 2023

Bobingstern commented Dec 5, 2022 •

edited

Loading