-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix BytePair special tokens tokenization #1447
Conversation
Thanks very much @abuelnasr0! Finally freeing up from our Gemma release. I'll try to review #1447, #1445 and #1397 as a set, but just a heads up I'll probably post feedback next week. In the meantime, if you are looking for something to do, we still need |
@mattdangerw no problem, Take your time. The Gemma release was awesome work from you and the team. |
We will update our samplers in the near future to push the backend specific compilation details out: keras-team#1425 Also in general, we want our documentation to reflect the main usage of our classes, which is using them with Seq2SeqLM and CausalLM classes. So with that in mind, this updates our sampler docs to show the practical usage of the sampling classes with our modeling classes. For the base class, we show the main use case of overriding the `get_next_token()` function.
The Keras implementation of the Gemma model was the effort of a number of contributors: - Initial architecture: Gabriel Rasskin, Francois Chollet, Matt Watson - Model parallelism: Qianli Scott Zhu - Model export for inference: Neel Kovelamudi - Lora implementation: Francois Chollet, Samaneh Saadat - Benchmarking: Haifeng Jin - Intepretability extensions: Ryan Mullins - Testing infrastructure: Ramesh Sampath Many more helped with documentaiton and Kaggle integration. Co-authored-by: Francois Chollet <[email protected]> Co-authored-by: Gabriel Rasskin <[email protected]> Co-authored-by: Qianli Scott Zhu <[email protected]> Co-authored-by: Neel Kovelamudi <[email protected]> Co-authored-by: Samaneh Saadat <[email protected]> Co-authored-by: Haifeng Jin <[email protected]> Co-authored-by: Ramesh Sampath <[email protected]> Co-authored-by: Ryan Mullins <[email protected]>
Includes some small cleanups for the Kaggle assets.
…as-team#1471) * Add docstring for conversion script install instructions * Add docstring to verification script * Change wording
We can skip these by default, for users who have not yet set them up. We will need to set them up for CI, see keras-team#1459
0.8 is out! We can consider our master branch an 0.9 preview.
Hi wonderful Keras folks, I was browsing the new Gemma source and noticed that the RMSNorm code didn't use the epsilon parameter it takes in. This fixes that. While we're here, I'm curious what drove the 1+scale multiplier (instead of just initializing scale to 1). Would love to learn if you're down to share. Thanks, Chris (ex-Googler)
* Add Falcon backbone. * Add docstring. * Add dtype. * Add checkpoint conversion script. * Fix tests. * Random fixes. * Add cache. * Cast cumsum to int32. * Make sublayers public. * Address backbone comments. * Update attention computation to use einsum. * Falcon only works with Keras3. * Fix tests. * Remove falcon_causal_lm file. * Remove commented/unused codes.
* CI - Add kaggle creds to pull model * add kaggle env variables * Kaggle env: * Kaggle env: * Kaggle env: * Kaggle env: * Update Build script for Kokoro * Add Kaggle env var * set gemma preset to extra_large * Change Gemma small preset to bfloat16 * Change Gemma small preset to xlarge
* Fix dtype accessors of tasks/backbones * Address comments, minor fixes
This reverts commit 97c3413.
* Docs(layers): add a description for `tie_weights` argument * Refactor(layers): make `name` an explicit argument for Transformer layers * Refactor(layers): remove explicit usage of `name` in `__init__` calls * Docs(layers): remove references to `name` and consistently documents `**kwargs`
…s-team#1397) * Support tokenization of special tokens for word_piece_tokenizer * Add the feature to models tokenizers * Format the code * Fix Fromat * Small fixes * Add tests for bert * Add tests for distilbert * Small fix for bert test * Add tests for electra * Fix code format * Rename unsplittable to special * Edit special_tokens Arg * Format the code * Move special tokens checking into base class * Add special_tokens_in_strings Arg * Shorten comments * Shorten comments * Shorten the logic og splitting and add comments * Code format
* Initial Kaggle upload. * Address review comments. * Add upload valiations. * Address review comments. * Fix init. * Address review comments. * Improve error handling. * Address review comments.
* Add scoring mode to MistralCausalLM * Fixing names in Docstring * Fix padding mask arg name * Fix embedded shape in test * Remove errant underscore in Docstring
* Add Kaggle upload validation tests. * Use bert_tiny as test model.
…1384) * Added ElectraBackbone * Added backbone tests for ELECTRA * Fix config * Add model import to __init__ * add electra tokenizer * add tests for tokenizer * add __init__ file * add tokenizer and backbone to models __init__ * Fix Failing tokenization test * Add example on usage of the tokenizer with custom vocabulary * Add conversion script to convert weights from checkpoint * Add electra preprocessor * Add presets and tests * Add presets config with model weights * Add checkpoint conversion script * Name conversion for electra models * Update naming conventions according to preset names * Fix failing tokenizer tests * Update checkpoint conversion script according to kaggle * Add validate function * Kaggle preset * update preset link * Add electra presets * Complete run_small_preset test for electra * Add large variations of electra in presets * Fix case issues with electra presets * Fix format --------- Co-authored-by: Matt Watson <[email protected]>
* first draft * update upload_preset * lint * consistent error messages * lint
* Add multitoken stopping * Update gemma_causal_lm.py * Add further multitoken support * Formatting * Revert tokenizer changes * Move multi token stop to generative task * None check * None check * Error message * Add stop_token_ids * Util testing * Fix sampler tests * All multitoken stop to all models * Sampler multi token * Formatting * Tuple required * Tuple docstring * Pytorch GPU fix * Numpy fix
* Add lora example to GemmaCausalLM docstring. * Address review.
* Add LLaMA Causal LM * Add causal lm to the public API * Update preset names and fix checkpoint script * Fix discrepancies and add tests * Add tests for CausalLM * end_token -> stop_token_ids
This PR grew as I was writing it, and now adds a number of new features: 1. Exposed base classes. Sets us on a path for better documentation, a more "introspectable" library, and allow sub-classing. 2. Enable `from_preset()` on base classes for any subclass preset. This gives us similar functionality to "auto classes" in huggingface, without the extra overhead of needing a new symbol. 3. An ability to register new tasks/backbones/tokenizers from out of tree code with `keras.saving.register_keras_serializable()`. Try a colab: https://colab.research.google.com/gist/mattdangerw/da885f050fa8baef9b4f9a4ec68d6567/kerasnlp-base-classes.ipynb
* Run the LLaMA RMS Layer Norm in float32 * Also use float32 in Mistral Layer Norm * Address review comments - Change private variables to public vars - Change `self._weight` to `self.scale` - Don't persist the input dim - Move the var computation to its own line for readability * Change weights to scale in layer norm
* Adds score API to GPT-2 * Addressing reviewer comments
…s-team#1523) * Implement compute_output_spec() for tokenizers with vocabulary. (restarted from new point in master branch) * Remove type annotation from compute_output_spec() in tokenizers
Currently Keras as a whole is not doing type annotiations, but we still have a few stragglers. Removing them as they occasionally cause confusion.
…am#1540) * Fix discrepency between HF LLaMA and our implementation * Fix Mistral transformer decoder
Bumps the python group with 2 updates: torch and torchvision. Updates `torch` from 2.2.1+cu121 to 2.2.2+cu121 Updates `torchvision` from 0.17.1+cu121 to 0.17.2+cu121 --- updated-dependencies: - dependency-name: torch dependency-type: direct:production update-type: version-update:semver-patch dependency-group: python - dependency-name: torchvision dependency-type: direct:production update-type: version-update:semver-patch dependency-group: python ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
BytePair already tokenize special tokens but it was having a small nit explained here #1435
this PR fixes it.