Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Qwen model implementation is too inaccurate #683

Open
1 task done
bryce13950 opened this issue Jul 23, 2024 · 3 comments
Open
1 task done

[Bug Report] Qwen model implementation is too inaccurate #683

bryce13950 opened this issue Jul 23, 2024 · 3 comments
Labels
complexity-high Very complicated changes for people to address who are quite familiar with the code implementation-inaccuracy Any issues related to our implementation being off from the official version needs-investigation Issues that need to be recreated, or investigated before work can be done

Comments

@bryce13950
Copy link
Collaborator

The whole Qwen model family seems to be pretty inaccurate. I have not done complete benchmarks to determine where the issue is yet. That still needs to be done to fine the specific area causing the error. This is probably due to einsum usage, and a slightly inaccuracy from the Transformers implementation. To solve this, we need to remove any potentially troublesome einsums that are in the model, which verify that any components used have similar implementations to transformers, which may result in the creation of more components in TransformerLens.

Describe the bug
The output is currently switching languages in what seems to be all models. I tested three different models, and I found that when putting in English, the output will sometimes be a bit of nonsense, and often with some Chinese mixed in. I then decided to generate a bit in Chinese, which resulted in kanji Japanese being generated. This is particularly interesting, since the characters I was using are in both Chinese and Japanese, but if the model mistook my input as Japanese, it should have still generated the same writing style.

Code example

model = HookedTransformer.from_pretrained_no_processing(
    "Qwen/Qwen-1_8B-Chat",
    fp32=True,
    dtype=torch.float32,
)
model.generate(
    "hello my name is ",
    verbose=False,
)

System Info
This was found in colab using various versions of TransformerLens 2.x and 1.x.

Additional context

Checklist

  • I have checked that there is no similar issue in the repo (required)
@bryce13950 bryce13950 added complexity-high Very complicated changes for people to address who are quite familiar with the code needs-investigation Issues that need to be recreated, or investigated before work can be done implementation-inaccuracy Any issues related to our implementation being off from the official version labels Jul 23, 2024
@mntss
Copy link
Contributor

mntss commented Jul 27, 2024

I think the problem you're seeing is caused by the prompt formatting and not the implementation differences.

I compared the TL to HF model and while there are some small differences in activations the logit differences seem negligible (max 0.0009 2.62e-06 diff in softmax outputs).

This model uses chatml template for the inputs, for your example, the format should be something like this:
'<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhello my name is <|im_end|>\n<|im_start|>assistant\n'

While the example above will use:
<|endoftext|>hello my name is

I noticed that TL will fallback to prepend to EOS token if BOS is absent, which seems incorrect in this case.

Code:

import torch
from torch.nn import functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformer_lens import HookedTransformer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-1_8B-Chat",
    trust_remote_code=True,
    add_bos_token=True,
)
model_hf = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-1_8B-Chat",
    trust_remote_code=True,
    device_map="cuda",
    fp32=True,
).eval()

model = HookedTransformer.from_pretrained_no_processing(
    "Qwen/Qwen-1_8B-Chat",
    dtype=torch.float32,
    device="cuda",
)

encoding = tokenizer("hello my name is ", return_tensors="pt")

response_hf = model_hf(encoding.input_ids.cuda())
logits_tl = model(encoding.input_ids.cuda())

diff = logits_tl - response_hf.logits
prob_diff = F.softmax(logits_tl, dim=-1) - F.softmax(response_hf.logits, dim=-1)
prob_diff.std().item(), prob_diff.mean().item(), prob_diff.max().item()

@bryce13950
Copy link
Collaborator Author

bryce13950 commented Jul 27, 2024

@mntss It is completely certain that the issue is implementation inaccuracy. This topic has been discussed a lot over the last few months. If you are curious about the details, I would refer you to the issues #570, and #685 which has been opened today. All three of these issues are related to the same problem, but the problem is systemic. If you are curious about the fix for something like this, then I would refer you to PR #652, which resolved the issue for mixtral, but the issue remains across most implementations. We are simply in the process of identifying which implementations are more impacted at this point.

The benchmark you ran is many magnitudes worse than other supported models. e.g. Mixtral was off by 1 hundred thousandth, yet it was generating French, Spanish, and German on English prompts. 0.009 is a remarkably bad result for the benchmark you are looking at. If you are curious to help resolve the issue, then let me know, and I can walk you through what the resolution process is. 95% of the problem is the usage of einsum, which is not used at all in any official implementation within Transformers. Once those implementations are removed, the inaccuracies in almost all cases clear up. The issue about EOS tokens could also be apart of the problem, but it is likely 10-20 different factors as was the case with Mixtral with the vast majority of the issue being caused by einsum.

I am in the middle of revamping weight conversions at the moment, so that benchmarking tools can be built that will automate the benchmark you ran, but it is a pretty involved process. Once I am done building benchmarking tools, I will be analyzing each implementation currently supported by TransformerLens to identify where the inaccuracies are most pronounced. If you are interested in helping with this process let me know! We are looking for people who are interested in helping resolve this problem across the board.

@mntss
Copy link
Contributor

mntss commented Jul 29, 2024

Thanks for clarifying! From the issue description, I assumed that the problem is the generated tokens being in Chinese and that behavior is the same for the HF implementation for this prompt, not the result of the inaccuracy.

Also, I found out that Qwen model does not respect the torch_dtype parameter, the actual max prob difference is 2.62e-6 (mean 3.26-e-09, std -1.78e-13). I updated my comment.
Here is the logit diff:
image

It would be helpful for me to understand the target implementation accuracy for TL. I noticed this test which expects a perfect match for GPT-2: https://github.com/TransformerLensOrg/TransformerLens/blob/main/tests/integration/test_match_huggingface.py

In the case of the Qwen model the attention modules seem like the main issue. However, the outputs of the MLP modules also do not match perfectly due to the fact that the weights are stored in different orientations. Eg. F.linear(inp, W_gate.T.contiguous()) == F.linear(inp, W_gate.T)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity-high Very complicated changes for people to address who are quite familiar with the code implementation-inaccuracy Any issues related to our implementation being off from the official version needs-investigation Issues that need to be recreated, or investigated before work can be done
Projects
None yet
Development

No branches or pull requests

2 participants