Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ascend NPU 910B3采用deepspeed引擎训练,Q1:未调用NPU,Q2:NPU健康状态是否影响训练。 #6428

Open
1 task done
Lexlum opened this issue Dec 24, 2024 · 2 comments
Labels
npu This problem is related to NPU devices pending This problem is yet to be addressed

Comments

@Lexlum
Copy link

Lexlum commented Dec 24, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

[2024-12-24 14:39:49,908] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-5.15.0-25-generic-aarch64-with-glibc2.27
  • Python version: 3.10.13
  • PyTorch version: 2.2.0 (NPU)
  • Transformers version: 4.41.2
  • Datasets version: 2.19.1
  • Accelerate version: 0.34.0
  • PEFT version: 0.11.1
  • TRL version: 0.8.6
  • NPU type: Ascend910B3
  • CANN version: 8.0.RC2.alpha001
  • DeepSpeed version: 0.13.2

Reproduction

NPU_VISIBLE_DEVICES="0,1,2,3,5,7" deepspeed --num_gpus 6 src/train.py
--deepspeed examples/deepspeed/ds_z3_config.json
--stage sft
--model_name_or_path /home/yunwei/LLaMA-Factory/Qwen2.5-1.5B-Instruct
--do_train
--dataset_dir /home/yunwei/LLaMA-Factory/sft_data
--dataset "aisp_dm_llm_dialogue_rectification"
--template qwen
--finetuning_type full
--output_dir saves/qwen2.5-1.5b/test/
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 5000
--learning_rate 1e-4
--num_train_epochs 2.0
--plot_loss
--bf16

以上是我的训练代码。
问题情况描述:当NPU数量设置为8,会一直卡在Converting format of dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4535/4535 [00:00<00:00, 5891.58 examples/s]这一步,一直不推进。此时我取消操作,退出会显示:Traceback (most recent call last):
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1209, in wait
return self._wait(timeout=timeout)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1959, in _wait
(pid, sts) = self._try_wait(0)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1917, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/python3.10.13/bin/deepspeed", line 6, in
main()
File "/usr/local/python3.10.13/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 584, in main
result.wait()
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1222, in wait
self._wait(timeout=sigint_timeout)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1953, in _wait
time.sleep(delay)
KeyboardInterrupt

提问:
1.npu-smi info 显示有4号卡和6号卡健康状态显示warning。是否会影响训练。

2.NPU_VISIBLE_DEVICES是否能够起作用。即使我跳过了4卡和6卡,仍然显示WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5]},并卡住在Converting format of dataset。
Uploading 微信截图_20241224144859.png…

3.我是用昇腾机器跑的,但deepspeed加载时显示[2024-12-24 14:35:17,414] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5。并且容器内观察npu-smi info发现进程并未加载到NPU上,deepspeed在NPU上使用需要修改哪些地方。

Expected behavior

No response

Others

No response

@github-actions github-actions bot added pending This problem is yet to be addressed npu This problem is related to NPU devices labels Dec 24, 2024
@Lexlum
Copy link
Author

Lexlum commented Dec 24, 2024

补充--include参数指定NPU的话会显示train.py: error: ambiguous option: --include=localhost:0,1,2,3,5,7 could match --include_inputs_for_metrics, --include_tokens_per_second, --include_num_input_tokens_seen, --include_effective_tokens_per_second

@Lexlum
Copy link
Author

Lexlum commented Dec 24, 2024

deepspeed --include localhost:0,1,2,3,5,7 src/train.py后可以训练并且看到Setting CUDA_VISIBLE_DEVICES=0,1,2,3,5,7,说明显卡设置应该是起作用了,但是依旧是在Converting format of dataset达到100%后卡住了。 此外,奇怪的是,当我设置deepspeed --num_gpus 4 src/train.py是可以训练的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
npu This problem is related to NPU devices pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant