Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resnet device #410

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Resnet device #410

wants to merge 8 commits into from

Conversation

ShawnXuan
Copy link
Contributor

No description provided.

@ShawnXuan ShawnXuan marked this pull request as draft May 28, 2024 14:36
@xiaohoua
Copy link

xiaohoua commented Jun 28, 2024

对比3090-4卡和910B4卡训练输出发现:二者输出逻辑不一样:npu4张卡会输出4遍同样的数据,3090则是分开输出。
910B输出:
image
3090输出:
image

@0x404
Copy link

0x404 commented Jul 16, 2024

graph测试脚本: train_graph_distributed_fp32.sh

运行报错信息如下:

[ERROR](GRAPH:TrainGraph_0:TrainGraph) building plan got error.
Traceback (most recent call last):
  File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 390, in <module>
    trainer()
  File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 223, in __call__
    self.train()
  File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 228, in train
    self.train_one_epoch()
  File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 248, in train_one_epoch
    loss, pred, label = self.train_graph()
  File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 284, in __call__
    self._compile(*args, **kwargs)
  File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 852, in _compile
    return self._compile_new(*args, **kwargs)
  File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 876, in _compile_new
    self.finish_compile_and_init_runtime()
  File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 1427, in finish_compile_and_init_runtime
    self._c_nn_graph.compile_plan_for_runtime()
oneflow._oneflow_internal.exception.RuntimeError: Error: TaskType: 1, DeviceType: 6 has not been registered

oneflow-npu合并支持graph PR: https://github.com/Oneflow-Inc/oneflow-npu/pull/217 报错信息如下:

Stack trace (most recent call last) in thread 3030764:
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed5bbca17, in 
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed5bbac2b, in Thread::PollMsgChannel()
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed57eb8d7, in 
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed549b0c7, in Kernel::Launch(KernelContext*) const
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed549aa63, in Kernel::Forward(KernelContext*) const
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed54e08b3, in UserKernel::ForwardDataContent(KernelContext*) const
   Object "/data1/home/zengqunhong/oneflow-npu/build/temp.linux-aarch64-cpython-39/oneflow_npu/liboneflow_npu.so", at 0xfffe6c5d0678, in 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants