Faster cleanup of sharded datasets #367
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Builds upon #321. Previously I used
dataset.save_to_disk
to write the final dataset, but this will rewrite the entire dataset to disk, which is very slow. Instead I manually move the shards to the standard hf format which allows us not to resave the entire dataset.Type of change
This should be externally identical, but just saving the dataset faster. My previous cached activation runner tests validate the change well.
Benchmarking cached activation benchmark on my mac:
I expect this to be even more pronounced for larger runs, where computing activations scales faster than disk speed.
Checklist:
You have tested formatting, typing and unit tests (acceptance tests not currently in use)
make check-ci
to check format and linting. (you can runmake format
to format code if needed.)