-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] Qwen model implementation is too inaccurate #683
Comments
I think the problem you're seeing is caused by the prompt formatting and not the implementation differences. I compared the TL to HF model and while there are some small differences in activations the logit differences seem negligible (max This model uses chatml template for the inputs, for your example, the format should be something like this: While the example above will use: I noticed that TL will fallback to prepend to EOS token if BOS is absent, which seems incorrect in this case. Code:
|
@mntss It is completely certain that the issue is implementation inaccuracy. This topic has been discussed a lot over the last few months. If you are curious about the details, I would refer you to the issues #570, and #685 which has been opened today. All three of these issues are related to the same problem, but the problem is systemic. If you are curious about the fix for something like this, then I would refer you to PR #652, which resolved the issue for mixtral, but the issue remains across most implementations. We are simply in the process of identifying which implementations are more impacted at this point. The benchmark you ran is many magnitudes worse than other supported models. e.g. Mixtral was off by 1 hundred thousandth, yet it was generating French, Spanish, and German on English prompts. 0.009 is a remarkably bad result for the benchmark you are looking at. If you are curious to help resolve the issue, then let me know, and I can walk you through what the resolution process is. 95% of the problem is the usage of einsum, which is not used at all in any official implementation within Transformers. Once those implementations are removed, the inaccuracies in almost all cases clear up. The issue about EOS tokens could also be apart of the problem, but it is likely 10-20 different factors as was the case with Mixtral with the vast majority of the issue being caused by einsum. I am in the middle of revamping weight conversions at the moment, so that benchmarking tools can be built that will automate the benchmark you ran, but it is a pretty involved process. Once I am done building benchmarking tools, I will be analyzing each implementation currently supported by TransformerLens to identify where the inaccuracies are most pronounced. If you are interested in helping with this process let me know! We are looking for people who are interested in helping resolve this problem across the board. |
Thanks for clarifying! From the issue description, I assumed that the problem is the generated tokens being in Chinese and that behavior is the same for the HF implementation for this prompt, not the result of the inaccuracy. Also, I found out that Qwen model does not respect the It would be helpful for me to understand the target implementation accuracy for TL. I noticed this test which expects a perfect match for GPT-2: https://github.com/TransformerLensOrg/TransformerLens/blob/main/tests/integration/test_match_huggingface.py In the case of the Qwen model the attention modules seem like the main issue. However, the outputs of the MLP modules also do not match perfectly due to the fact that the weights are stored in different orientations. Eg. |
The whole Qwen model family seems to be pretty inaccurate. I have not done complete benchmarks to determine where the issue is yet. That still needs to be done to fine the specific area causing the error. This is probably due to einsum usage, and a slightly inaccuracy from the Transformers implementation. To solve this, we need to remove any potentially troublesome einsums that are in the model, which verify that any components used have similar implementations to transformers, which may result in the creation of more components in TransformerLens.
Describe the bug
The output is currently switching languages in what seems to be all models. I tested three different models, and I found that when putting in English, the output will sometimes be a bit of nonsense, and often with some Chinese mixed in. I then decided to generate a bit in Chinese, which resulted in kanji Japanese being generated. This is particularly interesting, since the characters I was using are in both Chinese and Japanese, but if the model mistook my input as Japanese, it should have still generated the same writing style.
Code example
System Info
This was found in colab using various versions of TransformerLens 2.x and 1.x.
Additional context
Checklist
The text was updated successfully, but these errors were encountered: