-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the issue that len(indices) and num_samples might not be equal #1339
base: master
Are you sure you want to change the base?
Conversation
…en importing a checkpoint to resume training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add ut to cover the issue mentioned
since I adjusted the logic for splitting indices after loading the checkpoint, I corrected the num values in the UT file |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1339 +/- ##
==========================================
- Coverage 80.48% 79.82% -0.66%
==========================================
Files 219 240 +21
Lines 20208 22578 +2370
==========================================
+ Hits 16264 18023 +1759
- Misses 3944 4555 +611 ☔ View full report in Codecov by Sentry. |
@BalaBalaYi Hi,the submitted code format optimization has been completed |
What changes were proposed in this pull request?
Modified the definition of
total_size
in theload_state_dict
function and the definition ofindices
in the__iter__
function to ensure thatassert len(indices) == self.num_samples
.Why are the changes needed?
In the previous ElasticDistributedSampler code, the issue where
completed_num
not being divisible bynum_replicas
could cause anAssertionError
in the scenario of importing a checkpoint was not considered.Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT and training test.