Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing truncation in activations store data loading #62

Merged
merged 3 commits into from
Jun 2, 2024

Conversation

chanind
Copy link
Collaborator

@chanind chanind commented Apr 1, 2024

Currently, when ActivationsStore loads the next line of tokens while filling in batches, it truncates the line to the tokenizer max len (1024 tokens for gpt2). This truncation is not necessary, however, since we reshape the tokens that into batches of length context_size anyway, regardless of how long the original line of tokens is. As a result, we are likely cutting off a lot of the train dataset text when loading it into the ActivationsStore.

This PR addresses the issue by removing truncation from the tokenization step, so we always tokenize the full length of the training text line.

@chanind
Copy link
Collaborator Author

chanind commented Apr 1, 2024

CI is failing for this PR because the eindex dependency changed its name to eindex-callum. This issue is fixed in #63.

@jbloomAus
Copy link
Owner

jbloomAus commented Apr 1, 2024 via email

@chanind
Copy link
Collaborator Author

chanind commented Apr 1, 2024

ah interesting - if it's intentional I'll close this PR. I do feel like the behavior feels unexpected though - maybe we could add a param to the config for something like max_tokens_per_training_sample or something like that, so the behavior is more explicit? It seems like it could be decoupled from the tokenizer max_len as well, since it feels like there's no reason why the max len of the tokenizer would necessarily be the correct length to increase token variability.

@chanind chanind closed this Apr 1, 2024
@chanind chanind deleted the fix-early-data-truncation branch April 1, 2024 19:59
@jbloomAus
Copy link
Owner

jbloomAus commented Apr 1, 2024 via email

@chanind chanind restored the fix-early-data-truncation branch April 1, 2024 22:46
@chanind
Copy link
Collaborator Author

chanind commented Apr 1, 2024

OK - re-opening this for now

@chanind chanind reopened this Apr 1, 2024
Copy link

codecov bot commented Apr 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.11%. Comparing base (1a2cde0) to head (6435037).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #62   +/-   ##
=======================================
  Coverage   59.11%   59.11%           
=======================================
  Files          25       25           
  Lines        2595     2595           
  Branches      439      439           
=======================================
  Hits         1534     1534           
  Misses        984      984           
  Partials       77       77           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jbloomAus
Copy link
Owner

@chanind how did we not merge this?? We should merge it. If you get a sec can you please rebase? Let me know as I'll do it if needed.

@chanind chanind force-pushed the fix-early-data-truncation branch from 2a00041 to 6435037 Compare June 2, 2024 21:18
@chanind
Copy link
Collaborator Author

chanind commented Jun 2, 2024

my bad, updating this PR now and will merge once everything passes

@chanind chanind merged commit 43c93e2 into jbloomAus:main Jun 2, 2024
7 checks passed
@chanind chanind deleted the fix-early-data-truncation branch June 2, 2024 21:25
tom-pollak pushed a commit to tom-pollak/SAELens that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants