Make tokenize_and_concatenate
work with more datasets
#473
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
TL;DR I tried to use
tokenize_and_concatenate
with a general HF Dataset.tokenize_and_concatenate
was the function I wanted for my use case [1] but I had problems (two blocked by an error message [2], one that passed through without my knowledge [3])So this PR fixes this and keeps backward compatibility. The PR:
[1] Makes the function work with more than just arrow datasets. Maybe this was a feature in past HF, but now most datasets are not arrow datasets
[2] a) Only passes
num_proc
if we aren't streaming -- this causes a bug whenIterableDataset
are passed to this function[2] b) Similarly, skips the final formatting as this also causes a bug with
IterableDataset
s[3] Makes remove padding optional. E.g Pythia's training data had pad tokens, sometimes we want this
Type of change
Please delete options that are not relevant.
Checklist: