You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could it be possible to have multiple machines all generate training games over the same network and send the games generated to a "master" machine which will use the training data and train a new version of the model and then send the new model (after applying gating if enabled) back to the other machines and start over again? I'm 99% sure I can implement this using sockets fairly easily but I need to know a few things.
How are training samples stored and sampled for training
How can I "merge" training samples from multiple files into the base 3 .pkl files (data, policy, value)
I am reluctant to use Ray for this since it sounds like overkill for a simple task of generation and file transfer. Scheduling would be very straightforward, just count how many samples have been transfered and once they reach a threshold of say 1 mil, tell the other machines to stop generation and start training on the master machine. After this send the (gated) model back and have the machines run baseline testing if needed.
I think this is basically what Lc0 does but with distributed training as well which would probably need Ray
The text was updated successfully, but these errors were encountered:
Hey I dont really have an answer, but a question instead. What do I do when this happens? | _ | |
|_ |_ |_| v0.29.0 built Dec 13 2022
id name Lc0 v0.29.0
id author The LCZero Authors.
Loading weights file from: client-cache\46479f32cf5cf7bde9dfde78b7a33be50c3179a2497ebf518b746fb9b13fea62
Unhandled exception: Invalid weight file: lc0 version >= 0.30.0 required.
2023/05/29 17:20:20 lc0_main.go:732: GameInfo channel closed, exiting train loop
2023/05/29 17:20:20 lc0_main.go:749: Waiting for lc0 to stop
lc0 exited with: exit status 32212265052023/05/29 17:20:20 lc0_main.go:754: lc0 stopped
2023/05/29 17:20:20 lc0_main.go:756: Waiting for uploads to complete
2023/05/29 17:20:20 lc0_main.go:1255: Client self-exited without producing any games.
2023/05/29 17:20:20 lc0_main.go:1256: Sleeping for 30 seconds...
Could it be possible to have multiple machines all generate training games over the same network and send the games generated to a "master" machine which will use the training data and train a new version of the model and then send the new model (after applying gating if enabled) back to the other machines and start over again? I'm 99% sure I can implement this using sockets fairly easily but I need to know a few things.
I am reluctant to use Ray for this since it sounds like overkill for a simple task of generation and file transfer. Scheduling would be very straightforward, just count how many samples have been transfered and once they reach a threshold of say 1 mil, tell the other machines to stop generation and start training on the master machine. After this send the (gated) model back and have the machines run baseline testing if needed.
I think this is basically what Lc0 does but with distributed training as well which would probably need Ray
The text was updated successfully, but these errors were encountered: