Llama3 finetuning and generation: Double begin_of_text, no eot_id #1682

sanderland · 2024-08-20T13:50:38Z

Bug description

When finetuning Llama3, the encoded data has:

Duplicate <|begin_of_text|> at the start
- Tracked down to template + hf tokenizer both adding one.
No <|eot_id|> at the end in training -> Preserve eos in encoding when max_seq_length = -1 #1694

Seems related to #1565, but may be more widespread across models.

Going by the example which downloads alpaca finance:

litgpt finetune_full meta-llama/Meta-Llama-3.1-8B-Instruct \
  --config configs/llama31-8b.yaml \
  --data JSON \
  --data.json_path my_custom_dataset.json \
  --data.mask_prompt True \
  --data.prompt_style llama3 \
  --data.val_split_fraction 0.05

and adding this to full.py along with support for skip_special_tokens=False

        if fabric.global_rank == 0 and state["iter_num"] == 1:
            non_pad_ids = input_ids[0][input_ids[0] != 0] # assume pad token id is 0
            fabric.print(f"First row of input ids with total shape {input_ids.shape}: {non_pad_ids}")
            fabric.print(f"Detokenized: {tokenizer.decode(non_pad_ids, skip_special_tokens=False)}")

gives

First row of input ids with total shape torch.Size([4, 765]): tensor([128000, 128000, 128006,   9125, 128007,    271,   264, [...] 459,   9341,     13]
Detokenized: <|begin_of_text|><|begin_of_text|><|start_header_id|> [..] accurate valuation of an investment.

What operating system are you using?

Unknown

LitGPT Version

(close to) main

The text was updated successfully, but these errors were encountered:

rasbt · 2024-08-20T16:31:25Z

Thanks for raising that. Need to investigate in the next few days

rasbt · 2024-08-21T19:20:21Z

When you mentioned

(close to) main

could you check the version? Asking because I don't think that skip_special_tokens is a valid argument.

sanderland · 2024-08-21T20:54:57Z

When you mentioned

(close to) main

could you check the version? Asking because I don't think that skip_special_tokens is a valid argument.

version = "0.4.10", but when I said

adding this to full.py along with support for skip_special_tokens=False

I meant I added that option to help debug.

rasbt · 2024-08-21T21:23:55Z

Ah yes, the reason why I was asking is that I was getting a

TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'

and I was wondering where you applied this

sanderland · 2024-08-22T08:39:54Z

You can see my (somewhat messy) branch here: https://github.com/Lightning-AI/litgpt/compare/main...sanderland:dev?expand=1

rasbt · 2024-08-23T01:53:49Z

Ah thanks! I am still not understanding why this wouldn't work for me with a TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'. Need to investigate more (maybe a version issue).

Anyways, I just double-checked the generate_example function, and the for a prompt

What food do llamas eat?

The actual prompt that is passed to the tokenizer looks like this during finetuning with the default Alpaca style:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:

and then with the --data.prompt_style llama3 you were using:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Recommend a movie for me to watch during the weekend and explain the reason.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

So that part at least looks all ok to me.

sanderland · 2024-08-23T07:02:00Z

skip_special_tokens is a parameter in huggingface, but not in litgpt, I just added the pass-through to debug.

As for your prompt being correct, that doesn't mean the result of encode() is

from tokenizers import Tokenizer as HFTokenizer
processor = HFTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
processor.encode("prompt").ids # [128000, 41681] = "<|begin_of_text|>" , "prompt"

That is, there is a template which adds "<|begin_of_text|>" in the tokenizer.

sanderland · 2024-08-23T08:40:43Z

This is another confusing point
https://github.com/Lightning-AI/litgpt/blob/main/litgpt/tokenizer.py#L91
The tokenizer has special logic to add a bos token to llama3, but both the huggingface tokenizer AND the template add one already. At least it checks so doesn't end up with 3.

calvintwr · 2024-09-03T11:03:12Z

Actually I am curious as to how finetuning can work now given #1699

sanderland added the bug Something isn't working label Aug 20, 2024

sanderland mentioned this issue Aug 23, 2024

Preserve eos in encoding when max_seq_length = -1 #1694

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 finetuning and generation: Double begin_of_text, no eot_id #1682

Llama3 finetuning and generation: Double begin_of_text, no eot_id #1682

sanderland commented Aug 20, 2024 •

edited

Loading

rasbt commented Aug 20, 2024

rasbt commented Aug 21, 2024

sanderland commented Aug 21, 2024

rasbt commented Aug 21, 2024

sanderland commented Aug 22, 2024

rasbt commented Aug 23, 2024 •

edited

Loading

sanderland commented Aug 23, 2024 •

edited

Loading

sanderland commented Aug 23, 2024

calvintwr commented Sep 3, 2024

Llama3 finetuning and generation: Double begin_of_text, no eot_id #1682

Llama3 finetuning and generation: Double begin_of_text, no eot_id #1682

Comments

sanderland commented Aug 20, 2024 • edited Loading

Bug description

What operating system are you using?

LitGPT Version

rasbt commented Aug 20, 2024

rasbt commented Aug 21, 2024

sanderland commented Aug 21, 2024

rasbt commented Aug 21, 2024

sanderland commented Aug 22, 2024

rasbt commented Aug 23, 2024 • edited Loading

sanderland commented Aug 23, 2024 • edited Loading

sanderland commented Aug 23, 2024

calvintwr commented Sep 3, 2024

sanderland commented Aug 20, 2024 •

edited

Loading

rasbt commented Aug 23, 2024 •

edited

Loading

sanderland commented Aug 23, 2024 •

edited

Loading