Make `tokenize_and_concatenate` work with more datasets #473

ArthurConmy · 2023-12-28T22:11:39Z

Description

TL;DR I tried to use tokenize_and_concatenate with a general HF Dataset. tokenize_and_concatenate was the function I wanted for my use case [1] but I had problems (two blocked by an error message [2], one that passed through without my knowledge [3])

So this PR fixes this and keeps backward compatibility. The PR:

[1] Makes the function work with more than just arrow datasets. Maybe this was a feature in past HF, but now most datasets are not arrow datasets
[2] a) Only passes num_proc if we aren't streaming -- this causes a bug when IterableDataset are passed to this function
[2] b) Similarly, skips the final formatting as this also causes a bug with IterableDatasets
[3] Makes remove padding optional. E.g Pythia's training data had pad tokens, sometimes we want this

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
[ x ] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

[ x ] I have commented my code, particularly in hard-to-understand areas
[ x ] I have made corresponding changes to the documentation
[ x ] My changes generate no new warnings
[ x ] I have added tests that prove my fix is effective or that my feature works
[ x ] New and existing unit tests pass locally with my changes
[ x ] I have not rewritten tests relating to key interfaces which would affect backward compatibility

neelnanda-io · 2024-01-17T16:03:48Z

transformer_lens/utils.py

-        # Drop padding tokens
-        tokens = tokens[tokens != tokenizer.pad_token_id]
+        if remove_pad_tokens:
+            # Drop padding tokens


Can you add a comment that padding tokens may be there because the chunks are uneven length, because we split the text into 20 chunks (for tokenization efficiency), in addition to maybe being in the training data.

This reverts commit 7dd1895.

Add updated tokenize_concatenate options

ac7b4cb

ArthurConmy requested a review from neelnanda-io December 28, 2023 22:11

ArthurConmy added the enhancement New feature or request label Dec 28, 2023

neelnanda-io requested changes Jan 17, 2024

View reviewed changes

bryce13950 added 4 commits April 24, 2024 03:15

Merge branch 'main' into generalize-token-concat

7c5c70d

ran format

897fd73

updated attribution patching to run differnt code in github

7dd1895

Revert "updated attribution patching to run differnt code in github"

1bb22cc

This reverts commit 7dd1895.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `tokenize_and_concatenate` work with more datasets #473

Make `tokenize_and_concatenate` work with more datasets #473

ArthurConmy commented Dec 28, 2023 •

edited

Loading

neelnanda-io Jan 17, 2024

Make tokenize_and_concatenate work with more datasets #473

Are you sure you want to change the base?

Make tokenize_and_concatenate work with more datasets #473

Conversation

ArthurConmy commented Dec 28, 2023 • edited Loading

Description

Type of change

Checklist:

neelnanda-io Jan 17, 2024

Choose a reason for hiding this comment

Make `tokenize_and_concatenate` work with more datasets #473

Make `tokenize_and_concatenate` work with more datasets #473

ArthurConmy commented Dec 28, 2023 •

edited

Loading