You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you for your interest in LMFlow! Please specify the trained LoRA adapter using lora_model_path instead of resume_from_checkpoint. We will clarify the documentation further in the future. Apologies for any confusion!
I tried using resume_from_checkpoint, but I got an error.
tag: v0.0.8
script:
deepspeed ${deepspeed_args}
examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--conversation_template ${conversation_template}
--output_dir ${output_dir} --overwrite_output_dir
--resume_from_checkpoint ${resume_from_checkpoint_path}
--num_train_epochs 8
--learning_rate 2e-4
--block_size 1024
--per_device_train_batch_size 2
--use_qlora 1
--save_aggregated_lora 0
--deepspeed configs/ds_config_zero2.json
--fp16
--run_name ${exp_id}
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 200
--dataloader_num_workers 1
| tee ${log_dir}/train.log
2> ${log_dir}/train.err
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/rio/LMFlow-main/examples/finetune.py", line 61, in
[rank0]: main()
[rank0]: File "/data/rio/LMFlow-main/examples/finetune.py", line 57, in main
[rank0]: tuned_model = finetuner.tune(model=model, dataset=dataset)
[rank0]: File "/data/alex/LMFlow/src/lmflow/pipeline/finetuner.py", line 591, in tune
[rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: File "/home/robotuser/miniconda3/envs/lmflow_rio/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/robotuser/miniconda3/envs/lmflow_rio/lib/python3.9/site-packages/transformers/trainer.py", line 2119, in _inner_training_loop
[rank0]: deepspeed_load_checkpoint(
[rank0]: File "/home/robotuser/miniconda3/envs/lmflow_rio/lib/python3.9/site-packages/transformers/integrations/deepspeed.py", line 442, in deepspeed_load_checkpoint
[rank0]: raise ValueError(f"Can't find a valid checkpoint at {checkpoint_path}")
[rank0]: ValueError: Can't find a valid checkpoint at /data/rio/LMFlow-main/output_models/opening_sft_llama3_70b_1223/result/adapter_model
[2024-12-31 18:04:48,756] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4143052
[2024-12-31 18:04:48,757] [ERROR] [launch.py:322:sigkill_handler] ['/home/robotuser/miniconda3/envs/lmflow_rio/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', '/data/hf_cache/hub/models--meta-llama--Meta-Llama-3-70B-Instruct/snapshots/5fcb2901844dde3111159f24205b71c25900ffbd', '--trust_remote_code', '0', '--dataset_path', '/data/rio/LMFlow-main/data/opening_10000_views', '--conversation_template', 'llama3', '--output_dir', '/data/rio/LMFlow-main/output_models/opening_10000_views', '--overwrite_output_dir', '--resume_from_checkpoint', '/data/rio/LMFlow-main/output_models/opening_sft_llama3_70b_1223/result/adapter_model', '--num_train_epochs', '8', '--learning_rate', '2e-4', '--block_size', '1024', '--per_device_train_batch_size', '2', '--use_qlora', '1', '--save_aggregated_lora', '0', '--deepspeed', 'configs/ds_config_zero2.json', '--fp16', '--run_name', 'finetune_with_qlora_opening_10000_views', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '200', '--dataloader_num_workers', '1'] exits with return code = 1
The text was updated successfully, but these errors were encountered: