removing truncation in activations store data loading #62

chanind · 2024-04-01T15:06:52Z

Currently, when ActivationsStore loads the next line of tokens while filling in batches, it truncates the line to the tokenizer max len (1024 tokens for gpt2). This truncation is not necessary, however, since we reshape the tokens that into batches of length context_size anyway, regardless of how long the original line of tokens is. As a result, we are likely cutting off a lot of the train dataset text when loading it into the ActivationsStore.

This PR addresses the issue by removing truncation from the tokenization step, so we always tokenize the full length of the training text line.

chanind · 2024-04-01T15:41:12Z

CI is failing for this PR because the eindex dependency changed its name to eindex-callum. This issue is fixed in #63.

jbloomAus · 2024-04-01T16:51:07Z

This truncation was actually intentional to increase the variability in the tokens being used to train the SAE. We could do an AB test to see if it's worth it.

…

On Mon, Apr 1, 2024, 4:41 PM David Chanin ***@***.***> wrote: CI is failing for this PR because the eindex dependency changed its name to eindex-callum. This issue is fixed in #63 <#63>. — Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQPMYZ7H2Y7VNPYPNRBDQELY3F527AVCNFSM6AAAAABFRXUYZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQGAYTIOBXHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

chanind · 2024-04-01T19:59:31Z

ah interesting - if it's intentional I'll close this PR. I do feel like the behavior feels unexpected though - maybe we could add a param to the config for something like max_tokens_per_training_sample or something like that, so the behavior is more explicit? It seems like it could be decoupled from the tokenizer max_len as well, since it feels like there's no reason why the max len of the tokenizer would necessarily be the correct length to increase token variability.

jbloomAus · 2024-04-01T21:09:23Z

Agreed! We should think about this further. Don't close the pr for now.

…

On Mon, Apr 1, 2024, 8:59 PM David Chanin ***@***.***> wrote: ah interesting - if it's intentional I'll close this PR. I do feel like the behavior feels unexpected though - maybe we could add a param to the config for something like max_tokens_per_training_sample or something like that, so the behavior is more explicit? It seems like it could be decoupled from the tokenizer max_len as well, since there's no real reason why the max len of the tokenizer would necessarily be the correct length to increase token variability. — Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQPMYZ4XWAHIP7S4JFTQDY3Y3G4DRAVCNFSM6AAAAABFRXUYZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQGQ3DCMRWGI> . You are receiving this because you commented.Message ID: ***@***.***>

chanind · 2024-04-01T22:46:49Z

OK - re-opening this for now

codecov · 2024-04-01T22:52:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.11%. Comparing base (1a2cde0) to head (6435037).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #62   +/-   ##
=======================================
  Coverage   59.11%   59.11%           
=======================================
  Files          25       25           
  Lines        2595     2595           
  Branches      439      439           
=======================================
  Hits         1534     1534           
  Misses        984      984           
  Partials       77       77

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jbloomAus · 2024-05-29T15:02:49Z

@chanind how did we not merge this?? We should merge it. If you get a sec can you please rebase? Let me know as I'll do it if needed.

chanind · 2024-06-02T21:18:55Z

my bad, updating this PR now and will merge once everything passes

)

chanind closed this Apr 1, 2024

chanind deleted the fix-early-data-truncation branch April 1, 2024 19:59

chanind restored the fix-early-data-truncation branch April 1, 2024 22:46

chanind reopened this Apr 1, 2024

removing truncation in activations store data loading

106ef51

chanind force-pushed the fix-early-data-truncation branch from 58bb4f5 to 106ef51 Compare April 1, 2024 22:47

Merge branch 'main' into fix-early-data-truncation

770facc

chanind force-pushed the fix-early-data-truncation branch from 21cbb0d to 770facc Compare April 22, 2024 22:07

Merge branch 'main' into fix-early-data-truncation

6435037

chanind force-pushed the fix-early-data-truncation branch from 2a00041 to 6435037 Compare June 2, 2024 21:18

chanind merged commit 43c93e2 into jbloomAus:main Jun 2, 2024
7 checks passed

chanind deleted the fix-early-data-truncation branch June 2, 2024 21:25

tom-pollak pushed a commit to tom-pollak/SAELens that referenced this pull request Oct 22, 2024

fix: removing truncation in activations store data loading (jbloomAus#62

dbf8a82

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removing truncation in activations store data loading #62

removing truncation in activations store data loading #62

chanind commented Apr 1, 2024

chanind commented Apr 1, 2024

jbloomAus commented Apr 1, 2024 via email

chanind commented Apr 1, 2024 •

edited

Loading

jbloomAus commented Apr 1, 2024 via email

chanind commented Apr 1, 2024

codecov bot commented Apr 1, 2024 •

edited

Loading

jbloomAus commented May 29, 2024

chanind commented Jun 2, 2024

removing truncation in activations store data loading #62

removing truncation in activations store data loading #62

Conversation

chanind commented Apr 1, 2024

chanind commented Apr 1, 2024

jbloomAus commented Apr 1, 2024 via email

chanind commented Apr 1, 2024 • edited Loading

jbloomAus commented Apr 1, 2024 via email

chanind commented Apr 1, 2024

codecov bot commented Apr 1, 2024 • edited Loading

Codecov Report

jbloomAus commented May 29, 2024

chanind commented Jun 2, 2024

chanind commented Apr 1, 2024 •

edited

Loading

codecov bot commented Apr 1, 2024 •

edited

Loading