Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer overhaul (complete) #38

Merged
merged 325 commits into from
Aug 28, 2024
Merged

Tokenizer overhaul (complete) #38

merged 325 commits into from
Aug 28, 2024

Conversation

mivanit
Copy link
Member

@mivanit mivanit commented Jun 10, 2024

TODOS:

(roughly in order of importance)

  • checks.yml: status

  • all-tok-checks.yml: status

  • depends on Tokenizer Overhaul Step 2 #37 being merged into tokenizer-overhaul-integration first

  • depends on tests passing in maze-transformer #214

  • make sure that tokenizer hash numpy file is shipped with maze-dataset package

    • maybe make it optional, only if tokenization extra is requested?
    • no need to ship this, its only required for tests
  • go over existing # TODO comments, see if anything is critical

    • parametrize some tests in tests/unit/maze_dataset/generation/test_maze_dataset.py
  • clean up various superfluous warnings

  • consider optimizations to all_instances and save_hashes

    • unclear if save_hashes is being properly parallelized :/
  • see if any more utils should be moved to muutils

aaron-sandoval and others added 30 commits May 10, 2024 12:53
…ns_backwards_compatible` and `test_to_legacy_tokenizer` passing
…to resolve circular imports. Tests not run yet. Will need to refactor imports in maze-transformer and other repos
… few key tests are passing, but I suspect that overriding `TokenizerElement.__hash__` wholesale is not a good idea.
…`Cardinal` and its references since VSCode pytest extension was crashing during test collection. When testing other part of the code, recommend nullifying `Cardinal` and `Relative`. This may be done by either commenting them and all references out or by adding `@abc.abstractmethod` to their method to remove them from consideration by `all_instances(StepTokenizer)`. Would still need to remove any uses that attempted to construct `Cardinal()` or `Relative()`, which right now only exist in `test_tokenizer.py`.
….Relative` since building and testing `PathTokenizers` is done for now. Need to reactivate them later.
mivanit and others added 17 commits August 20, 2024 15:34
…ed tokenizer regions. Lots of superfluous warnings
wandb fails in the backend when passing data with this key and a value
of longer than 64 chars, see:
wandb/wandb#8159

this was causing integration tests to hang idefinitely in understanding-search/maze-transformer#214
I think it's fine to have shorthands for these things, as well as to have backwards compatibility.
@aaron-sandoval aaron-sandoval linked an issue Aug 22, 2024 that may be closed by this pull request
Copy link
Member Author

@mivanit mivanit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving -- this is finally done!

@mivanit mivanit merged commit d5a20d6 into main Aug 28, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request tokenization anything to do with tokenization
Projects
None yet
2 participants