Ascend NPU 910B3采用deepspeed引擎训练，Q1:未调用NPU，Q2:NPU健康状态是否影响训练。 #6428

Lexlum · 2024-12-24T06:49:22Z

Reminder

I have read the README and searched the existing issues.

System Info

[2024-12-24 14:39:49,908] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)

llamafactory version: 0.9.2.dev0
Platform: Linux-5.15.0-25-generic-aarch64-with-glibc2.27
Python version: 3.10.13
PyTorch version: 2.2.0 (NPU)
Transformers version: 4.41.2
Datasets version: 2.19.1
Accelerate version: 0.34.0
PEFT version: 0.11.1
TRL version: 0.8.6
NPU type: Ascend910B3
CANN version: 8.0.RC2.alpha001
DeepSpeed version: 0.13.2

Reproduction

NPU_VISIBLE_DEVICES="0,1,2,3,5,7" deepspeed --num_gpus 6 src/train.py
--deepspeed examples/deepspeed/ds_z3_config.json
--stage sft
--model_name_or_path /home/yunwei/LLaMA-Factory/Qwen2.5-1.5B-Instruct
--do_train
--dataset_dir /home/yunwei/LLaMA-Factory/sft_data
--dataset "aisp_dm_llm_dialogue_rectification"
--template qwen
--finetuning_type full
--output_dir saves/qwen2.5-1.5b/test/
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 5000
--learning_rate 1e-4
--num_train_epochs 2.0
--plot_loss
--bf16

以上是我的训练代码。
问题情况描述：当NPU数量设置为8，会一直卡在Converting format of dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4535/4535 [00:00<00:00, 5891.58 examples/s]这一步，一直不推进。此时我取消操作，退出会显示:Traceback (most recent call last):
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1209, in wait
return self._wait(timeout=timeout)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1959, in _wait
(pid, sts) = self._try_wait(0)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1917, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/python3.10.13/bin/deepspeed", line 6, in
main()
File "/usr/local/python3.10.13/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 584, in main
result.wait()
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1222, in wait
self._wait(timeout=sigint_timeout)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1953, in _wait
time.sleep(delay)
KeyboardInterrupt

提问：
1.npu-smi info 显示有4号卡和6号卡健康状态显示warning。是否会影响训练。

2.NPU_VISIBLE_DEVICES是否能够起作用。即使我跳过了4卡和6卡，仍然显示WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5]}，并卡住在Converting format of dataset。

3.我是用昇腾机器跑的，但deepspeed加载时显示[2024-12-24 14:35:17,414] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5。并且容器内观察npu-smi info发现进程并未加载到NPU上，deepspeed在NPU上使用需要修改哪些地方。

Expected behavior

No response

Others

No response

The text was updated successfully, but these errors were encountered:

Lexlum · 2024-12-24T06:52:04Z

补充--include参数指定NPU的话会显示train.py: error: ambiguous option: --include=localhost:0,1,2,3,5,7 could match --include_inputs_for_metrics, --include_tokens_per_second, --include_num_input_tokens_seen, --include_effective_tokens_per_second

Lexlum · 2024-12-24T07:04:06Z

deepspeed --include localhost:0,1,2,3,5,7 src/train.py后可以训练并且看到Setting CUDA_VISIBLE_DEVICES=0,1,2,3,5,7，说明显卡设置应该是起作用了，但是依旧是在Converting format of dataset达到100%后卡住了。此外，奇怪的是，当我设置deepspeed --num_gpus 4 src/train.py是可以训练的。

github-actions bot added pending This problem is yet to be addressed npu This problem is related to NPU devices labels Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ascend NPU 910B3采用deepspeed引擎训练，Q1:未调用NPU，Q2:NPU健康状态是否影响训练。 #6428

Ascend NPU 910B3采用deepspeed引擎训练，Q1:未调用NPU，Q2:NPU健康状态是否影响训练。 #6428

Lexlum commented Dec 24, 2024

Lexlum commented Dec 24, 2024

Lexlum commented Dec 24, 2024

Ascend NPU 910B3采用deepspeed引擎训练，Q1:未调用NPU，Q2:NPU健康状态是否影响训练。 #6428

Ascend NPU 910B3采用deepspeed引擎训练，Q1:未调用NPU，Q2:NPU健康状态是否影响训练。 #6428

Comments

Lexlum commented Dec 24, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

Lexlum commented Dec 24, 2024

Lexlum commented Dec 24, 2024