You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to the 20B_tokenizer.json, the end of document (EOD) token has id 0 and is denoted <|endoftext|>. Some people have raised in ealier issues that there are no EOD tokens in EleutherAI/pile-deduped-pythia-preshuffled. This issue seems to be standing since January 2024. I processed the pile as instructed in the README.md based on EleutherAI/pile-deduped-pythia-preshuffled. However, I can confirm that there appear to be no EOD tokens in the dataset. Beyond using the batch_viewer.py, I also started a training loop and recorded x.min() at the beginning of my forward(x) function. Both methods show that the smallest token ID is 2. From this I conclude that there are no EOD tokens in this version of the dataset. This causes serious issues for both training and evaluating on other datasets that use the EOD token (which never got gradient updates for a model trained on EleutherAI/pile-deduped-pythia-preshuffled). Would it be possible to provide a tokenized pile with 2049 tokens per sequence that does separate documents with EODs?
The text was updated successfully, but these errors were encountered:
According to the
20B_tokenizer.json
, the end of document (EOD) token has id 0 and is denoted<|endoftext|>
. Some people have raised in ealier issues that there are no EOD tokens inEleutherAI/pile-deduped-pythia-preshuffled
. This issue seems to be standing since January 2024. I processed the pile as instructed in the README.md based onEleutherAI/pile-deduped-pythia-preshuffled
. However, I can confirm that there appear to be no EOD tokens in the dataset. Beyond using the batch_viewer.py, I also started a training loop and recordedx.min()
at the beginning of myforward(x)
function. Both methods show that the smallest token ID is 2. From this I conclude that there are no EOD tokens in this version of the dataset. This causes serious issues for both training and evaluating on other datasets that use the EOD token (which never got gradient updates for a model trained onEleutherAI/pile-deduped-pythia-preshuffled
). Would it be possible to provide a tokenized pile with 2049 tokens per sequence that does separate documents with EODs?The text was updated successfully, but these errors were encountered: