-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
removing truncation in activations store data loading #62
Conversation
CI is failing for this PR because the |
This truncation was actually intentional to increase the variability in the
tokens being used to train the SAE. We could do an AB test to see if it's
worth it.
…On Mon, Apr 1, 2024, 4:41 PM David Chanin ***@***.***> wrote:
CI is failing for this PR because the eindex dependency changed its name
to eindex-callum. This issue is fixed in #63
<#63>.
—
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQPMYZ7H2Y7VNPYPNRBDQELY3F527AVCNFSM6AAAAABFRXUYZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQGAYTIOBXHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
ah interesting - if it's intentional I'll close this PR. I do feel like the behavior feels unexpected though - maybe we could add a param to the config for something like |
Agreed! We should think about this further. Don't close the pr for now.
…On Mon, Apr 1, 2024, 8:59 PM David Chanin ***@***.***> wrote:
ah interesting - if it's intentional I'll close this PR. I do feel like
the behavior feels unexpected though - maybe we could add a param to the
config for something like max_tokens_per_training_sample or something
like that, so the behavior is more explicit? It seems like it could be
decoupled from the tokenizer max_len as well, since there's no real reason
why the max len of the tokenizer would necessarily be the correct length to
increase token variability.
—
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQPMYZ4XWAHIP7S4JFTQDY3Y3G4DRAVCNFSM6AAAAABFRXUYZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQGQ3DCMRWGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
OK - re-opening this for now |
58bb4f5
to
106ef51
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #62 +/- ##
=======================================
Coverage 59.11% 59.11%
=======================================
Files 25 25
Lines 2595 2595
Branches 439 439
=======================================
Hits 1534 1534
Misses 984 984
Partials 77 77 ☔ View full report in Codecov by Sentry. |
21cbb0d
to
770facc
Compare
@chanind how did we not merge this?? We should merge it. If you get a sec can you please rebase? Let me know as I'll do it if needed. |
2a00041
to
6435037
Compare
my bad, updating this PR now and will merge once everything passes |
Currently, when
ActivationsStore
loads the next line of tokens while filling in batches, it truncates the line to the tokenizer max len (1024 tokens for gpt2). This truncation is not necessary, however, since we reshape the tokens that into batches of lengthcontext_size
anyway, regardless of how long the original line of tokens is. As a result, we are likely cutting off a lot of the train dataset text when loading it into the ActivationsStore.This PR addresses the issue by removing truncation from the tokenization step, so we always tokenize the full length of the training text line.