Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to go on fine-tuning with a lora checkpoint? #926

Open
ORGRUI opened this issue Dec 31, 2024 · 1 comment
Open

How to go on fine-tuning with a lora checkpoint? #926

ORGRUI opened this issue Dec 31, 2024 · 1 comment

Comments

@ORGRUI
Copy link

ORGRUI commented Dec 31, 2024

I tried using resume_from_checkpoint, but I got an error.

tag: v0.0.8

script:
deepspeed ${deepspeed_args}
examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--conversation_template ${conversation_template}
--output_dir ${output_dir} --overwrite_output_dir
--resume_from_checkpoint ${resume_from_checkpoint_path}
--num_train_epochs 8
--learning_rate 2e-4
--block_size 1024
--per_device_train_batch_size 2
--use_qlora 1
--save_aggregated_lora 0
--deepspeed configs/ds_config_zero2.json
--fp16
--run_name ${exp_id}
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 200
--dataloader_num_workers 1
| tee ${log_dir}/train.log
2> ${log_dir}/train.err

[rank0]: Traceback (most recent call last):
[rank0]: File "/data/rio/LMFlow-main/examples/finetune.py", line 61, in
[rank0]: main()
[rank0]: File "/data/rio/LMFlow-main/examples/finetune.py", line 57, in main
[rank0]: tuned_model = finetuner.tune(model=model, dataset=dataset)
[rank0]: File "/data/alex/LMFlow/src/lmflow/pipeline/finetuner.py", line 591, in tune
[rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: File "/home/robotuser/miniconda3/envs/lmflow_rio/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/robotuser/miniconda3/envs/lmflow_rio/lib/python3.9/site-packages/transformers/trainer.py", line 2119, in _inner_training_loop
[rank0]: deepspeed_load_checkpoint(
[rank0]: File "/home/robotuser/miniconda3/envs/lmflow_rio/lib/python3.9/site-packages/transformers/integrations/deepspeed.py", line 442, in deepspeed_load_checkpoint
[rank0]: raise ValueError(f"Can't find a valid checkpoint at {checkpoint_path}")
[rank0]: ValueError: Can't find a valid checkpoint at /data/rio/LMFlow-main/output_models/opening_sft_llama3_70b_1223/result/adapter_model
[2024-12-31 18:04:48,756] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4143052
[2024-12-31 18:04:48,757] [ERROR] [launch.py:322:sigkill_handler] ['/home/robotuser/miniconda3/envs/lmflow_rio/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', '/data/hf_cache/hub/models--meta-llama--Meta-Llama-3-70B-Instruct/snapshots/5fcb2901844dde3111159f24205b71c25900ffbd', '--trust_remote_code', '0', '--dataset_path', '/data/rio/LMFlow-main/data/opening_10000_views', '--conversation_template', 'llama3', '--output_dir', '/data/rio/LMFlow-main/output_models/opening_10000_views', '--overwrite_output_dir', '--resume_from_checkpoint', '/data/rio/LMFlow-main/output_models/opening_sft_llama3_70b_1223/result/adapter_model', '--num_train_epochs', '8', '--learning_rate', '2e-4', '--block_size', '1024', '--per_device_train_batch_size', '2', '--use_qlora', '1', '--save_aggregated_lora', '0', '--deepspeed', 'configs/ds_config_zero2.json', '--fp16', '--run_name', 'finetune_with_qlora_opening_10000_views', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '200', '--dataloader_num_workers', '1'] exits with return code = 1

@wheresmyhair wheresmyhair added the pending Something isn't working label Jan 1, 2025
@wheresmyhair
Copy link
Collaborator

Hi, thank you for your interest in LMFlow! Please specify the trained LoRA adapter using lora_model_path instead of resume_from_checkpoint. We will clarify the documentation further in the future. Apologies for any confusion!

@wheresmyhair wheresmyhair removed the pending Something isn't working label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants