-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama3 finetuning and generation: Double begin_of_text, no eot_id #1682
Comments
Thanks for raising that. Need to investigate in the next few days |
When you mentioned
could you check the version? Asking because I don't think that |
version = "0.4.10", but when I said
I meant I added that option to help debug. |
Ah yes, the reason why I was asking is that I was getting a
and I was wondering where you applied this |
You can see my (somewhat messy) branch here: https://github.com/Lightning-AI/litgpt/compare/main...sanderland:dev?expand=1 |
Ah thanks! I am still not understanding why this wouldn't work for me with a Anyways, I just double-checked the
The actual prompt that is passed to the tokenizer looks like this during finetuning with the default Alpaca style:
and then with the
So that part at least looks all ok to me. |
As for your prompt being correct, that doesn't mean the result of encode() is
That is, there is a template which adds "<|begin_of_text|>" in the tokenizer. |
This is another confusing point |
Actually I am curious as to how finetuning can work now given #1699 |
Bug description
When finetuning Llama3, the encoded data has:
Seems related to #1565, but may be more widespread across models.
Going by the example which downloads alpaca finance:
and adding this to full.py along with support for
skip_special_tokens=False
gives
What operating system are you using?
Unknown
LitGPT Version
(close to) main
The text was updated successfully, but these errors were encountered: