加载完模型刚刚开始训练时,显示 torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass. #6438
Labels
pending
This problem is yet to be addressed
Reminder
System Info
[2024-12-25 03:16:41,883] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
llamafactory
version: 0.9.2.dev0Reproduction
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/developer/zctech/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/home/developer/zctech/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/developer/zctech/LLaMA-Factory/src/llamafactory/train/tuner.py", line 59, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/home/developer/zctech/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/transformers/trainer.py", line 2122, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/transformers/trainer.py", line 3606, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank0]: self.engine.backward(loss, **kwargs)
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank0]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank0]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]: scaled_loss.backward(retain_graph=retain_graph)
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/torch/_tensor.py", line 581, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/torch/autograd/init.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/torch/autograd/function.py", line 307, in apply
[rank0]: return user_fn(self, *args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 511, in decorate_bwd
[rank0]: return bwd(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py", line 80, in backward
[rank0]: input, weight, bias = ctx.saved_tensors
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 1129, in unpack_hook
[rank0]: frame.check_recomputed_tensors_match(gid)
[rank0]: File "/home/developer/anaconda3/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 903, in check_recomputed_tensors_match
[rank0]: raise CheckpointError(
[rank0]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
[rank0]: tensor at position 4:
[rank0]: saved metadata: {'shape': torch.Size([4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 6:
[rank0]: saved metadata: {'shape': torch.Size([4608, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 7:
[rank0]: saved metadata: {'shape': torch.Size([4608]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 29:
[rank0]: saved metadata: {'shape': torch.Size([4096, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 38:
[rank0]: saved metadata: {'shape': torch.Size([4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 40:
[rank0]: saved metadata: {'shape': torch.Size([27392, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 49:
[rank0]: saved metadata: {'shape': torch.Size([4096, 13696]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
0%|
Expected behavior
CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train glm4_lora_sft_ds3.yaml
model
model_name_or_path: THUDM/glm-4-9b-chat
trust_remote_code: true
method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
deepspeed: ds_z3_config.json
dataset
dataset: sft_data_multiround_with_CoT_2, sft_data_multiround_with_CoT_1
template: glm4
cutoff_len: 20000
max_samples: 1300
overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: saves/gml4/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 100
Others
No response
The text was updated successfully, but these errors were encountered: