-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPO训练问题 #41
Comments
可能是数据集质量问题,https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji |
感谢回复!想问一下就是我从dpo训练集里抽了几条数据测试我训练好的模型,但模型并没有按照训练的chosen答案进行回答(当然也没有按照rejected的去答),感觉和未训练过的模型答的差别不太大,这种情况是正常的吗,还是说loss收敛之后训练集的问题应该完全按照chosen去答(之前一直做SFT,对dpo不太了解) 另外可以请教下一般dpo训练多少epoch为好,loss降到什么值效果比较好,rewards/margins能够作为衡量模型效果的指标吗,还有像beta等其他参数有什么调整策略吗? |
rewards/margins可以作为参照但不必要,beta从0.05到0.5之间的调整还是很有必要实验的,一般只推荐训练 1 个epoch即可 |
好的,感谢~ |
@CrazyBoyM @chanel111 你好,请问一下你们在使用LLama3-Instruction直接在中文数据上进行DPO的过程中,有遇到DPO训练过后的模型response会出现生成重复的这种现象吗,有通用的稳定的解决方案吗?谢谢 |
你好你好,这个问题我也遇到了,我在尝试对glm4微调,用的就是https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji这个数据集,训练和测试loss均下降了,但调完与原始模型生成的答案几乎一样,甚至训练数据生成的也没有什么偏好,请问你这个问题解决了吗 |
dpo训练小白想请教下大家,我用llama-3-8b-instruct尝试进行dpo训练,数据是从hf上找的中文和英文的dpo数据,训练了4个epoch之后loss已经降到0.1左右,进行测试,模型效果不仅没有提升还出现各种各样的问题,甚至问dpo训练集里的都会出现重复瞎答的现象
下面是我训练的代码,不知道是不是哪里出现bug
import torch
from transformers import AutoTokenizer, TrainingArguments, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import load_dataset
from trl import DPOTrainer
from peft import LoraConfig
output_dir = "./llama3_dpo_lora_result/"
model_name = "./Meta-Llama-3-8B-Instruct/"
dataset = load_dataset("json", data_files="./dpo_train_data_sample.json", split="train")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map='auto',
quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
def print_trainable_parameters(input_model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in input_model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainables%: {100 * trainable_params / all_param}"
)
def return_prompt_and_responses(samples):
return {
"prompt": [
f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
for input in samples["prompt"]
],
"chosen": [
f"{chose}<|eot_id|>" for chose in samples["chosen"]
],
"rejected": [
f"{reject}<|eot_id|>" for reject in samples["rejected"]
],
}
original_columns = dataset.column_names
dataset = dataset.map(
return_prompt_and_responses,
batched=True,
remove_columns=original_columns
)
peft_config = LoraConfig(
lora_alpha=256,
lora_dropout=0.05,
r=128,
bias="none",
target_modules="all-linear",
task_type="CAUSAL_LM",
)
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing =True,
max_grad_norm= 0.3,
num_train_epochs=8,
save_steps= 500,
learning_rate=2e-6,
bf16=True,
save_total_limit=6,
logging_steps=10,
output_dir=output_dir,
optim="paged_adamw_32bit",
lr_scheduler_type="cosine",
warmup_ratio=0.05,
remove_unused_columns=False
)
print_trainable_parameters(model)
dpo_trainer = DPOTrainer(
model,
ref_model=None,
peft_config=peft_config,
args=training_args,
beta=0.5,
train_dataset=dataset,
tokenizer=tokenizer,
max_prompt_length=1024,
max_length=2048,
)
dpo_trainer.train()
dpo_trainer.save_model(output_dir)
dpo_trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
The text was updated successfully, but these errors were encountered: