Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reward/chosen is decreasing #42

Open
zhangguoxin1 opened this issue Jul 15, 2024 · 6 comments
Open

reward/chosen is decreasing #42

zhangguoxin1 opened this issue Jul 15, 2024 · 6 comments

Comments

@zhangguoxin1
Copy link

zhangguoxin1 commented Jul 15, 2024

image
Hi!
I am fine-tuning LLaMA3 on the hh-rlhf dataset using SimPo and noticed that the reward/chosen reward is decreasing. Is this reasonable?
`# SimPOTrainer arguments

bf16: true
beta: 2.5
gamma: 1.4
per_device_train_batch_size: 2
per_device_eval_batch_size: 4
do_eval: true
eval_strategy: steps
eval_steps: 500
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: False
learning_rate: 5.0e-5
num_train_epochs: 1
log_level: info
logging_steps: 5
lr_scheduler_type: cosine
max_length: 2048
max_prompt_length: 1800
optim: adamw_torch
output_dir: outputs/llama-3-8b-instruct-simpo-hh
run_name: llama-3-8b-instruct-simpo-hh
force_use_ref_model: True
push_to_hub: false
save_strategy: "steps"
save_steps: 500
remove_unused_columns: False
save_total_limit: 20
seed: 42
warmup_ratio: 0.1
`

@zhangguoxin1
Copy link
Author

I expected the reward/chosen to increase, but since the goal of SimPo is to maximize the difference between reward/chosen and reward/rejected, it is acceptable for reward/chosen to decrease to a certain extent. However, the extent of the decrease in reward/chosen seems a bit large compared to reward/chosen - reward/rejected.

@zhangguoxin1 zhangguoxin1 changed the title reward/chosen reward/chosen is decreasing Jul 15, 2024
@yumeng5
Copy link
Collaborator

yumeng5 commented Jul 15, 2024

Hi,

Yes, this is reasonable. The reward margin should increase but the reward on chosen responses may slightly decrease (and the reward on rejected decreases more rapidly). In general, we don't want the reward on chosen to decrease too much (as that implies the likelihood of chosen responses is decreasing), and you may use a larger beta or a lower learning rate to mitigate the decrease of reward on chosen responses.

Best,
Yu

@zhangguoxin1
Copy link
Author

get it!

Thanks for the quick reply.

@zhangguoxin1
Copy link
Author

zhangguoxin1 commented Aug 19, 2024

Hi,
I used Simpo in my task with qwen2_7B (there are approximately 40,000 data entries), but the model generated repeated sentences and pre-trained data. The parameters are as follows:

pref_beta: 2.5
simpo_gamma: 1.0
learning_rate: 1.0e-6
num_train_epochs: 3.0

image

and I'm try use a larger beta=8.0

@xiamengzhou
Copy link
Contributor

@zhangguoxin1
I think you should be using Qwen2-7B-Instruct rather than Qwen2-7B if you only running PO? Also I'd suggest that you use online data rather offline data that is generated by other models.

@yumeng5
Copy link
Collaborator

yumeng5 commented Aug 19, 2024

Hi @zhangguoxin1

In addition to the suggestions by Mengzhou, you may try the following as well:

  • decrease the learning rate (we usually start learning rate search around 5e-7)
  • reduce the number of training epochs (we generally train the model for only one epoch)

Best,
Yu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants