sagfault in scatter_min with ROCM #420

jychoi-hpc · 2024-02-12T04:25:30Z

I am trying to run pytorch_scatter with ROCM but keep getting segfault. I installed pytorch rocm version (stable 2.2) with pip and then built pytorch scatter from the source code in the master branch (the last commit is c095c62). However, I got segfault with the following case:

import torch
from torch_scatter import scatter_min

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

src = torch.Tensor([[-2, 0, -1, -4, -3], [0, -2, -1, -3, -4]]).to(device)
index = torch.tensor([[ 4, 5,  4,  2,  3], [0,  0,  2,  2,  1]]).to(device)
out = src.new_zeros((2, 6)).to(device)

out, argmin = scatter_min(src, index, out=out)

print(out)
print(argmin)

It gave a core file and it shows the following traces:

#0  0x00007f8f0c02a22c in std::map<std::string, ReductionType, std::less<std::string>, std::allocator<std::pair<std::string const, ReductionType> > >::at(std::string const&) const ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#1  0x00007f8f0c01768b in scatter_cuda(at::Tensor, at::Tensor, long, std::optional<at::Tensor>, std::optional<long>, std::string) ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#2  0x00007f8f0c02f951 in scatter_fw (src=..., index=..., dim=1, optional_out=..., dim_size=..., reduce=...) at csrc/scatter_hip.cpp:42
#3  0x00007f8f0c044039 in ScatterMin::forward (ctx=ctx@entry=0x6ab3b08, src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:175
#4  0x00007f8f0c044dca in torch::autograd::Function<ScatterMin>::apply<ScatterMin, at::Tensor&, at::Tensor&, long&, std::optional<at::Tensor>&, std::optional<long>&> ()
    at /lustre/orion/world-shared/cph161/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch/include/torch/csrc/autograd/custom_function.h:305
#5  0x00007f8f0c0311b5 in scatter_min (src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:261

I appreciate any advice in advance.

The text was updated successfully, but these errors were encountered:

rusty1s · 2024-02-12T07:07:15Z

@Looong01 Do you see similar issues when installing torch-scatter on ROCM? Do you know what might cause this?

Looong01 · 2024-02-12T20:01:36Z

I am trying to run pytorch_scatter with ROCM but keep getting segfault. I installed pytorch rocm version (stable 2.2) with pip and then built pytorch scatter from the source code in the master branch (the last commit is c095c62). However, I got segfault with the following case:

import torch
from torch_scatter import scatter_min

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

src = torch.Tensor([[-2, 0, -1, -4, -3], [0, -2, -1, -3, -4]]).to(device)
index = torch.tensor([[ 4, 5,  4,  2,  3], [0,  0,  2,  2,  1]]).to(device)
out = src.new_zeros((2, 6)).to(device)

out, argmin = scatter_min(src, index, out=out)

print(out)
print(argmin)

It gave a file and it shows the following traces:core

#0  0x00007f8f0c02a22c in std::map<std::string, ReductionType, std::less<std::string>, std::allocator<std::pair<std::string const, ReductionType> > >::at(std::string const&) const ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#1  0x00007f8f0c01768b in scatter_cuda(at::Tensor, at::Tensor, long, std::optional<at::Tensor>, std::optional<long>, std::string) ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#2  0x00007f8f0c02f951 in scatter_fw (src=..., index=..., dim=1, optional_out=..., dim_size=..., reduce=...) at csrc/scatter_hip.cpp:42
#3  0x00007f8f0c044039 in ScatterMin::forward (ctx=ctx@entry=0x6ab3b08, src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:175
#4  0x00007f8f0c044dca in torch::autograd::Function<ScatterMin>::apply<ScatterMin, at::Tensor&, at::Tensor&, long&, std::optional<at::Tensor>&, std::optional<long>&> ()
    at /lustre/orion/world-shared/cph161/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch/include/torch/csrc/autograd/custom_function.h:305
#5  0x00007f8f0c0311b5 in scatter_min (src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:261

I appreciate any advice in advance.

Sorry I recieve no errors when I run these codes.
My device is Radeon RX 7900XTX and RX6700XT. And the codes run smoothly on both of them.
My env is python 3.10 and rocm 6.0.2
This is my screenshot:

Looong01 · 2024-02-12T20:20:06Z

What is the error it showed in bash?
What is the type of your GPU?
Maybe you could update your ROCm version and then test it again.

P.S. I also test it on python 3.8 and I meet no errors.

jychoi-hpc · 2024-02-12T21:27:34Z

It just a segfault.

$ python test.py 
scatter_min: torch.Size([2, 5])
Segmentation fault (core dumped)

AMD MI250X (gfx90a)
Will try. Thank you for the advice.

jychoi-hpc · 2024-02-12T21:29:20Z

I have one question. I am trying to some simple debug. Can you give some advice which source file I can look and put some debugging information, based on the traces from core dump?

Looong01 · 2024-02-13T00:19:12Z

I have one question. I am trying to some simple debug. Can you give some advice which source file I can look and put some debugging information, based on the traces from core dump?

Actually, I cannot understand and try to debug this kind of problems because torch_scatter consists of CUDA & C++.
And "core dump" usually is a kind of error that everything may be possible (as experience of myself). I think it definitely due to the module of CUDA&C++.
So the only suggestion I can give u is 1. try to reinstall a brand new OS, and 2. use Docker to use a brand new OS env.

jychoi-hpc · 2024-02-13T16:28:21Z

Thank you for the advice. Unfortunately, I cannot install a new OS. If I find any clue, I will post here.

ashwinma · 2024-02-19T22:16:38Z

I am trying something similar with ROCm 6. I am getting errors for scatter_min and scatter_max -- but scatter_mean and scatter_sum work fine!

I installed PT for ROCm 6 like below

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

>>> from torch_scatter import scatter_min
>>> scatter_min(src, index)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch_scatter/scatter.py", line 65, in scatter_min
    return torch.ops.torch_scatter.scatter_min(src, index, dim, out, dim_size)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
    return self_._op(*args, **(kwargs or {}))
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 699051.85 GiB. GPU
>>> from torch_scatter import scatter_max
>>> scatter_max(src, index)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch_scatter/scatter.py", line 72, in scatter_max
    return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
    return self_._op(*args, **(kwargs or {}))
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 699051.66 GiB. GPU

>>> from torch_scatter import scatter_sum
>>> out, argmin = scatter_sum(src, index)
>>> scatter_sum(src, index)
tensor([[ 0.,  0., -4., -3., -3.,  0.],
        [-2., -4., -4.,  0.,  0.,  0.]], device='cuda:0')
>>> from torch_scatter import scatter_mean
>>> scatter_mean(src, index)
tensor([[ 0.0000,  0.0000, -4.0000, -3.0000, -1.5000,  0.0000],
        [-1.0000, -4.0000, -2.0000,  0.0000,  0.0000,  0.0000]],
       device='cuda:0')

ashwinma · 2024-02-19T23:22:36Z

I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0

Looong01 · 2024-02-19T23:43:12Z

I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0

Did u try the wheels I compiled?

ashwinma · 2024-02-19T23:53:20Z

I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0

Did u try the wheels I compiled?

Yes I did. I tried just to import torch_scatter but it gave me the below GLIBC error

>>> import torch
>>> from torch_scatter import scatter_min
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch_scatter/__init__.py", line 16, in <module>
    torch.ops.load_library(spec.origin)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch/_ops.py", line 933, in load_library
    ctypes.CDLL(path)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /lustre/orion/ven114/proj-shared/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch_scatter/_version_cuda.so)

Looong01 · 2024-02-20T00:06:20Z

Well, u need to update ur glibc version. May be g++-12. U can see another discussion in this repo, which I answer how to deal with this problem

ashwinma · 2024-02-20T00:11:02Z

The GCC/G++ version is indeed 12

> gcc --version
gcc (GCC) 12.2.0 20220819 (HPE)

The OS is SUSE Linux Enterprise Server 15 SP4

On which OS have you built your wheels?

Looong01 · 2024-02-20T00:37:56Z

The GCC/G++ version is indeed 12
> gcc --version
gcc (GCC) 12.2.0 20220819 (HPE)
The OS is SUSE Linux Enterprise Server 15 SP4

On which OS have you built your wheels?

Looong01/pyg-rocm-build#3

ashwinma · 2024-02-20T08:21:18Z

@Looong01 unfortunately, I do not have root access and do not have the privileges to upgrade the OS on this cluster. Can you suggest any alternatives?

github-actions · 2024-08-19T01:20:48Z

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?

Looong01 · 2024-08-19T01:25:38Z

此问题已 6 个月没有活动。除非有一些新的活动，否则它将在 2 周内关闭。这个问题已经解决了吗？

https://github.com/Looong01/pyg-rocm-build/issues/3

I think so.

claCase mentioned this issue Mar 30, 2024

Segmentation fault (core dumped) on import torch_geometric pyg-team/pytorch_geometric#4363

Closed

github-actions bot added the stale label Aug 19, 2024

github-actions bot removed the stale label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sagfault in scatter_min with ROCM #420

sagfault in scatter_min with ROCM #420

jychoi-hpc commented Feb 12, 2024 •

edited

Loading

rusty1s commented Feb 12, 2024

Looong01 commented Feb 12, 2024 •

edited

Loading

Looong01 commented Feb 12, 2024

jychoi-hpc commented Feb 12, 2024

jychoi-hpc commented Feb 12, 2024

Looong01 commented Feb 13, 2024

jychoi-hpc commented Feb 13, 2024

ashwinma commented Feb 19, 2024 •

edited

Loading

ashwinma commented Feb 19, 2024

Looong01 commented Feb 19, 2024

ashwinma commented Feb 19, 2024 •

edited

Loading

Looong01 commented Feb 20, 2024

ashwinma commented Feb 20, 2024

Looong01 commented Feb 20, 2024

ashwinma commented Feb 20, 2024

github-actions bot commented Aug 19, 2024

Looong01 commented Aug 19, 2024

sagfault in scatter_min with ROCM #420

sagfault in scatter_min with ROCM #420

Comments

jychoi-hpc commented Feb 12, 2024 • edited Loading

rusty1s commented Feb 12, 2024

Looong01 commented Feb 12, 2024 • edited Loading

Looong01 commented Feb 12, 2024

jychoi-hpc commented Feb 12, 2024

jychoi-hpc commented Feb 12, 2024

Looong01 commented Feb 13, 2024

jychoi-hpc commented Feb 13, 2024

ashwinma commented Feb 19, 2024 • edited Loading

ashwinma commented Feb 19, 2024

Looong01 commented Feb 19, 2024

ashwinma commented Feb 19, 2024 • edited Loading

Looong01 commented Feb 20, 2024

ashwinma commented Feb 20, 2024

Looong01 commented Feb 20, 2024

ashwinma commented Feb 20, 2024

github-actions bot commented Aug 19, 2024

Looong01 commented Aug 19, 2024

jychoi-hpc commented Feb 12, 2024 •

edited

Loading

Looong01 commented Feb 12, 2024 •

edited

Loading

ashwinma commented Feb 19, 2024 •

edited

Loading

ashwinma commented Feb 19, 2024 •

edited

Loading