Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sagfault in scatter_min with ROCM #420

Open
jychoi-hpc opened this issue Feb 12, 2024 · 17 comments
Open

sagfault in scatter_min with ROCM #420

jychoi-hpc opened this issue Feb 12, 2024 · 17 comments

Comments

@jychoi-hpc
Copy link

jychoi-hpc commented Feb 12, 2024

I am trying to run pytorch_scatter with ROCM but keep getting segfault. I installed pytorch rocm version (stable 2.2) with pip and then built pytorch scatter from the source code in the master branch (the last commit is c095c62). However, I got segfault with the following case:

import torch
from torch_scatter import scatter_min

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

src = torch.Tensor([[-2, 0, -1, -4, -3], [0, -2, -1, -3, -4]]).to(device)
index = torch.tensor([[ 4, 5,  4,  2,  3], [0,  0,  2,  2,  1]]).to(device)
out = src.new_zeros((2, 6)).to(device)

out, argmin = scatter_min(src, index, out=out)

print(out)
print(argmin)

It gave a core file and it shows the following traces:

#0  0x00007f8f0c02a22c in std::map<std::string, ReductionType, std::less<std::string>, std::allocator<std::pair<std::string const, ReductionType> > >::at(std::string const&) const ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#1  0x00007f8f0c01768b in scatter_cuda(at::Tensor, at::Tensor, long, std::optional<at::Tensor>, std::optional<long>, std::string) ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#2  0x00007f8f0c02f951 in scatter_fw (src=..., index=..., dim=1, optional_out=..., dim_size=..., reduce=...) at csrc/scatter_hip.cpp:42
#3  0x00007f8f0c044039 in ScatterMin::forward (ctx=ctx@entry=0x6ab3b08, src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:175
#4  0x00007f8f0c044dca in torch::autograd::Function<ScatterMin>::apply<ScatterMin, at::Tensor&, at::Tensor&, long&, std::optional<at::Tensor>&, std::optional<long>&> ()
    at /lustre/orion/world-shared/cph161/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch/include/torch/csrc/autograd/custom_function.h:305
#5  0x00007f8f0c0311b5 in scatter_min (src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:261

I appreciate any advice in advance.

@rusty1s
Copy link
Owner

rusty1s commented Feb 12, 2024

@Looong01 Do you see similar issues when installing torch-scatter on ROCM? Do you know what might cause this?

@Looong01
Copy link

Looong01 commented Feb 12, 2024

I am trying to run pytorch_scatter with ROCM but keep getting segfault. I installed pytorch rocm version (stable 2.2) with pip and then built pytorch scatter from the source code in the master branch (the last commit is c095c62). However, I got segfault with the following case:

import torch
from torch_scatter import scatter_min

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

src = torch.Tensor([[-2, 0, -1, -4, -3], [0, -2, -1, -3, -4]]).to(device)
index = torch.tensor([[ 4, 5,  4,  2,  3], [0,  0,  2,  2,  1]]).to(device)
out = src.new_zeros((2, 6)).to(device)

out, argmin = scatter_min(src, index, out=out)

print(out)
print(argmin)

It gave a file and it shows the following traces:core

#0  0x00007f8f0c02a22c in std::map<std::string, ReductionType, std::less<std::string>, std::allocator<std::pair<std::string const, ReductionType> > >::at(std::string const&) const ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#1  0x00007f8f0c01768b in scatter_cuda(at::Tensor, at::Tensor, long, std::optional<at::Tensor>, std::optional<long>, std::string) ()
   from /lustre/orion/cph161/world-shared/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch_scatter-2.1.2-py3.8-linux-x86_64.egg/torch_scatter/_scatter_cuda.so
#2  0x00007f8f0c02f951 in scatter_fw (src=..., index=..., dim=1, optional_out=..., dim_size=..., reduce=...) at csrc/scatter_hip.cpp:42
#3  0x00007f8f0c044039 in ScatterMin::forward (ctx=ctx@entry=0x6ab3b08, src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:175
#4  0x00007f8f0c044dca in torch::autograd::Function<ScatterMin>::apply<ScatterMin, at::Tensor&, at::Tensor&, long&, std::optional<at::Tensor>&, std::optional<long>&> ()
    at /lustre/orion/world-shared/cph161/jyc/frontier/sw/anaconda3/2022.10/envs/py38-rocm571/lib/python3.8/site-packages/torch/include/torch/csrc/autograd/custom_function.h:305
#5  0x00007f8f0c0311b5 in scatter_min (src=..., index=..., dim=<optimized out>, optional_out=..., dim_size=...) at csrc/scatter_hip.cpp:261

I appreciate any advice in advance.

Sorry I recieve no errors when I run these codes.
My device is Radeon RX 7900XTX and RX6700XT. And the codes run smoothly on both of them.
My env is python 3.10 and rocm 6.0.2
This is my screenshot:
image

@Looong01
Copy link

  1. What is the error it showed in bash?
  2. What is the type of your GPU?
  3. Maybe you could update your ROCm version and then test it again.

P.S. I also test it on python 3.8 and I meet no errors.

@jychoi-hpc
Copy link
Author

  1. It just a segfault.
$ python test.py 
scatter_min: torch.Size([2, 5])
Segmentation fault (core dumped)
  1. AMD MI250X (gfx90a)
  2. Will try. Thank you for the advice.

@jychoi-hpc
Copy link
Author

I have one question. I am trying to some simple debug. Can you give some advice which source file I can look and put some debugging information, based on the traces from core dump?

@Looong01
Copy link

I have one question. I am trying to some simple debug. Can you give some advice which source file I can look and put some debugging information, based on the traces from core dump?

Actually, I cannot understand and try to debug this kind of problems because torch_scatter consists of CUDA & C++.
And "core dump" usually is a kind of error that everything may be possible (as experience of myself). I think it definitely due to the module of CUDA&C++.
So the only suggestion I can give u is 1. try to reinstall a brand new OS, and 2. use Docker to use a brand new OS env.

@jychoi-hpc
Copy link
Author

Thank you for the advice. Unfortunately, I cannot install a new OS. If I find any clue, I will post here.

@ashwinma
Copy link
Contributor

ashwinma commented Feb 19, 2024

I am trying something similar with ROCm 6. I am getting errors for scatter_min and scatter_max -- but scatter_mean and scatter_sum work fine!

I installed PT for ROCm 6 like below

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0
>>> from torch_scatter import scatter_min
>>> scatter_min(src, index)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch_scatter/scatter.py", line 65, in scatter_min
    return torch.ops.torch_scatter.scatter_min(src, index, dim, out, dim_size)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
    return self_._op(*args, **(kwargs or {}))
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 699051.85 GiB. GPU
>>> from torch_scatter import scatter_max
>>> scatter_max(src, index)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch_scatter/scatter.py", line 72, in scatter_max
    return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm60/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
    return self_._op(*args, **(kwargs or {}))
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 699051.66 GiB. GPU
>>> from torch_scatter import scatter_sum
>>> out, argmin = scatter_sum(src, index)
>>> scatter_sum(src, index)
tensor([[ 0.,  0., -4., -3., -3.,  0.],
        [-2., -4., -4.,  0.,  0.,  0.]], device='cuda:0')
>>> from torch_scatter import scatter_mean
>>> scatter_mean(src, index)
tensor([[ 0.0000,  0.0000, -4.0000, -3.0000, -1.5000,  0.0000],
        [-1.0000, -4.0000, -2.0000,  0.0000,  0.0000,  0.0000]],
       device='cuda:0')

@ashwinma
Copy link
Contributor

I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0

@Looong01
Copy link

I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0

Did u try the wheels I compiled?

@ashwinma
Copy link
Contributor

ashwinma commented Feb 19, 2024

I must note that this issue is not there when I use PT2.0.1+rocm5.3. It only manifests when we try PT2.2+ROCm5.7 or 2.3.0.dev20240219+rocm6.0

Did u try the wheels I compiled?

Yes I did. I tried just to import torch_scatter but it gave me the below GLIBC error

>>> import torch
>>> from torch_scatter import scatter_min
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch_scatter/__init__.py", line 16, in <module>
    torch.ops.load_library(spec.origin)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch/_ops.py", line 933, in load_library
    ctypes.CDLL(path)
  File "/lustre/orion/proj-shared/ven114/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /lustre/orion/ven114/proj-shared/ashwinaji/miniconda3/envs/pyg_rocm57/lib/python3.10/site-packages/torch_scatter/_version_cuda.so)

@Looong01
Copy link

Well, u need to update ur glibc version. May be g++-12. U can see another discussion in this repo, which I answer how to deal with this problem

@ashwinma
Copy link
Contributor

The GCC/G++ version is indeed 12

> gcc --version
gcc (GCC) 12.2.0 20220819 (HPE)

The OS is SUSE Linux Enterprise Server 15 SP4

On which OS have you built your wheels?

@Looong01
Copy link

The GCC/G++ version is indeed 12

> gcc --version
gcc (GCC) 12.2.0 20220819 (HPE)

The OS is SUSE Linux Enterprise Server 15 SP4

On which OS have you built your wheels?

Looong01/pyg-rocm-build#3

@ashwinma
Copy link
Contributor

@Looong01 unfortunately, I do not have root access and do not have the privileges to upgrade the OS on this cluster. Can you suggest any alternatives?

Copy link

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?

@github-actions github-actions bot added the stale label Aug 19, 2024
@Looong01
Copy link

此问题已 6 个月没有活动。除非有一些新的活动,否则它将在 2 周内关闭。这个问题已经解决了吗?

https://github.com/Looong01/pyg-rocm-build/issues/3

I think so.

@github-actions github-actions bot removed the stale label Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants