Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Installation]: AMD MI60 (gfx906) installation errors with ROCm 6.1 and 6.2 #774

Open
Said-Akbar opened this issue Oct 12, 2024 · 10 comments

Comments

@Said-Akbar
Copy link

Said-Akbar commented Oct 12, 2024

Your current environment

python env.py
Collecting environment information...
PyTorch version: 2.6.0.dev20241011+rocm6.2
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.2.41133-dd7f95766

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.5.119
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Radeon Graphics (gfx906:sramecc+:xnack-)
Nvidia driver version: 550.90.07
cuDNN version: Could not collect
HIP runtime version: 6.2.41133
MIOpen runtime version: 3.2.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 9 5950X 16-Core Processor
CPU family:                           25
Model:                                33
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             0
Frequency boost:                      enabled
CPU max MHz:                          5083.3979
CPU min MHz:                          2200.0000
BogoMIPS:                             6800.12
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
L1d cache:                            512 KiB (16 instances)
L1i cache:                            512 KiB (16 instances)
L2 cache:                             8 MiB (16 instances)
L3 cache:                             64 MiB (2 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-31
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pytorch-triton-rocm==3.1.0+cf34004b8a
[pip3] pyzmq==26.2.0
[pip3] torch==2.6.0.dev20241011+rocm6.2
[pip3] torchaudio==2.5.0.dev20241011+rocm6.2
[pip3] torchvision==0.20.0.dev20241011+rocm6.2
[pip3] transformers==4.44.1
[conda] Could not collect
ROCM Version: 6.2.41134-65d174c3e
Neuron SDK Version: N/A
Aphrodite Version: N/A
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-31	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How did you install Aphrodite?

python3 -m venv myenv && source myenv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2/
git clone https://github.com/PygmalionAI/aphrodite-engine.git
cd aphrodite-engine
pip install -U -r requirements-rocm.txt
pip install ninja==1.10.2.4 # for compatibility with the installer.
# There is no documentation about this, but I needed to update setup.py line 20 set to 'rocm' instead of cuda:
# APHRODITE_TARGET_DEVICE = os.getenv("APHRODITE_TARGET_DEVICE", "rocm")
python3 setup.py develop
# initially, above command showed error that thrust library was not compatible with rocm and then I found out it was
# using NVIDIA thrust located at /usr/include/thrust. I could not find which env var was responsible for that and removed 
# thrust (nvidia's) folder from /usr/include/thrust and copied AMD's thrust folder from rocm-6.2.2/include/thrust/

I have 2x AMD MI60 and 1xRTX 3060 for video output. I want to install aphrodite-engine to use with those 2x AMD GPUs. I installed rocm and pytorch with all the dependencies.
I spent a few hours to find out that I needed to change setup.py line 20 to APHRODITE_TARGET_DEVICE = os.getenv("APHRODITE_TARGET_DEVICE", "rocm"). After that, I struggled with thrust library being incorrect. cmake was using NVIDIA's thrust from my NVIDIA GPUs. Then I figured out where AMD's thrust folder was and replaced Nvidias thrust with AMDs.

At last, the engine was compiling but at the end it failed with multiple warnings and errors. I tried both ROCm 6.1 and 6.2. Both failed with the same error. The error text is around 6k lines, so attaching as txt file here.
errors6_2_w_thrust.txt
Sharing some warning and error messages below from that text file:

[1/21] Building CXX object CMakeFiles/_core_C.dir/kernels/core/torch_bindings.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
...
[7/21] Building HIP object CMakeFiles/_C.dir/kernels/hip_utils_kernels.hip.o
/home/saidp/Downloads/amd_llm/aphrodite-engine/build/temp.linux-x86_64-3.10/kernels/hip_utils_kernels.hip:9:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    9 |     hipGetDevice(&device);
      |     ^~~~~~~~~~~~ ~~~~~~~
/home/saidp/Downloads/amd_llm/aphrodite-engine/build/temp.linux-x86_64-3.10/kernels/hip_utils_kernels.hip:13:3: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   13 |   hipDeviceGetAttribute(&value, static_cast<hipDeviceAttribute_t>(attribute),
      |   ^~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   14 |                          device);
      |                          ~~~~~~
2 warnings generated when compiling for gfx906.
/home/saidp/Downloads/amd_llm/aphrodite-engine/build/temp.linux-x86_64-3.10/kernels/hip_utils_kernels.hip:9:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    9 |     hipGetDevice(&device);
      |     ^~~~~~~~~~~~ ~~~~~~~
/home/saidp/Downloads/amd_llm/aphrodite-engine/build/temp.linux-x86_64-3.10/kernels/hip_utils_kernels.hip:13:3: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   13 |   hipDeviceGetAttribute(&value, static_cast<hipDeviceAttribute_t>(attribute),
      |   ^~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   14 |                          device);
      |                          ~~~~~~
2 warnings generated when compiling for host.
[8/21] Building HIP object CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o
FAILED: CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o 
/opt/rocm-6.2.2/lib/llvm/bin/clang++  -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_C -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_PROF_API=1 -DUSE_RPC -DUSE_TENSORPIPE -D_C_EXPORTS -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -I/home/saidp/Downloads/amd_llm/aphrodite-engine/kernels -isystem /usr/include/python3.10 -isystem /home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/torch/include -isystem /home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/rocm-6.2.2/include/hiprand -O2 -g -DNDEBUG -std=gnu++20 --offload-arch=gfx906 --offload-arch=gfx906 -fPIC -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DUSE_ROCM -DENABLE_FP8 -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF_OPERATORS__ -fno-gpu-rdc -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_HIP_VERSION=602 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIP_NEW_TYPE_ENUMS -MD -MT CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o -MF CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o.d -o CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o -x hip -c /home/saidp/Downloads/amd_llm/aphrodite-engine/build/temp.linux-x86_64-3.10/kernels/attention/attention_kernels.hip
/home/saidp/Downloads/amd_llm/aphrodite-engine/build/temp.linux-x86_64-3.10/kernels/attention/attention_kernels.hip:746:7: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
  746 |       LAUNCH_PAGED_ATTENTION_V1(64);
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/saidp/Downloads/amd_llm/aphrodite-engine/build/temp.linux-x86_64-3.10/kernels/attention/attention_kernels.hip:676:3: note: expanded from macro 'LAUNCH_PAGED_ATTENTION_V1'
  676 |   APHRODITE_DevFuncAttribute_SET_MaxDynamicSharedMemorySize(                   \
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  677 |       ((void*)aphrodite::paged_attention_v1_kernel<                            \
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  678 |           T, CACHE_T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS, KV_DTYPE,            \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  679 |           IS_BLOCK_SPARSE>),                                                   \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  680 |       shared_mem_size);                                                        \
      |       ~~~~~~~~~~~~~~~~
...
In file included from /opt/rocm-6.2.2/lib/llvm/lib/clang/18/include/__clang_hip_runtime_wrapper.h:143:
/opt/rocm-6.2.2/lib/llvm/lib/clang/18/include/__clang_hip_cmath.h:400:20: error: call to '__test' is ambiguous
  400 |   typedef decltype(__test(declval<_Tp>())) type;
      |                    ^~~~~~
...
332 warnings and 1 error generated when compiling for gfx906.
[9/21] Building HIP object CMakeFiles/_C.dir/kernels/moe/align_block_size_kernel.hip.o
[10/21] Building HIP object CMakeFiles/_C.dir/kernels/quantization/squeezellm/quant_hip_kernel.hip.o
[11/21] Building HIP object CMakeFiles/_C.dir/kernels/quantization/compressed_tensors/int8_quant_kernels.hip.o
[12/21] Building HIP object CMakeFiles/_C.dir/kernels/prepare_inputs/advance_step.hip.o
...
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/saidp/Downloads/amd_llm/aphrodite-engine/setup.py", line 461, in <module>
    setup(
  File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.10/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/setuptools/command/develop.py", line 34, in run
    self.install_for_development()
  File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/setuptools/command/develop.py", line 114, in install_for_development
    self.run_command('build_ext')
  File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/usr/lib/python3.10/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/home/saidp/Downloads/amd_llm/aphrodite-engine/setup.py", line 223, in build_extensions
    subprocess.check_call(["cmake", *build_args], cwd=self.build_temp)
  File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=32', '--target=_core_C', '--target=_moe_C', '--target=_C']' returned non-zero exit status 1.

Please, let me know if this is a version mismatch issue or a bug in the engine. Looking forward to a fix.

Thank you!

@Said-Akbar
Copy link
Author

it looks like it is failing to compile paged attention from above logs:

FAILED: CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o

@Naomiusearch
Copy link
Contributor

It's actually issue with ROCM. There's a fix though. Also aphrodite doesn't work on amd right now anyway

@Said-Akbar
Copy link
Author

@Naomiusearch I see.

Does your latest pull #775 fix this issue for AMD GPUs?

Also, I see you have a fork of aphrodite for amd - https://github.com/Naomiusearch/aphrodite-engine/tree/amd-fix - were you able to compile the engine and run models successfully with it?

Thanks!

@Naomiusearch
Copy link
Contributor

It somewhat fixes the issue. You just have to run ./amdpatch.sh
FP16 and GPTQ should work, but I didn't manage to run any models bigger than 70m, cause profiling peak memory takes a really long time.

@Said-Akbar
Copy link
Author

thanks! What might be the reason for it to take a long time to profile? Also, what GPUs are you using?

@Naomiusearch
Copy link
Contributor

I use 7800 xt + 7900 xtx, no idea why profiling takes so long

@Said-Akbar
Copy link
Author

I see. I was able to install Aphrodite thanks to your fix. However, I stumbled on another issue when loading llama3 8b fp16. After profiling started, a minute later I saw this error:
error: triton_flash_attention.py:211:0: stack frame size (164332) exceeds limit (131056) in function 'attn_fwd_0d1d2d3de45de6d7de8de9de10c11de12de13de14c15de16de17de18c19de20de21de22c23de24de25de26de27d28d29303132de'
SO, for me and possibly for you also, the issue is triton.
I found out someone else also had similar issue - vllm-project/vllm#4514 (comment). I also did something similar.
In aphrodite-engine/aphrodite/attention/ops, I changed triton_flash_attn.py starting from line 211, commented out blocks that have Block size 256. Also disable the ones with BLOCK_N=128. This makes it run models at FP16. But, it is very slow. For llama3 8b fp16, I am getting around 4 t/s.
I think I need to correctly compile triton. There is a ROCm fork of triton - https://github.com/ROCm/triton. I installed it (Version: 2.1.0). But Aphrodite engine defaults to system installed pytorch-triton-rocm (Version: 3.1.0+cf34004b8a) which results in above error (stack frame size exceeds limit). Now, I am trying to figure out how to correctly compile triton and force Aphrodite to use that compiled version. BY the way, if I uninstall pytorch-triton-rocm, Aphrodite shows an error that pytorch-triton-rocm is missing but in fact I already have a compiled triton. So, Aphrodite is defaulting to pytorch-triton-rocm which does not target our GPUs (they target AMD MI200+). Let me know if you figure out this. Thanks!

@Naomiusearch
Copy link
Contributor

Naomiusearch commented Oct 15, 2024

Works fine on my PC, llama3 8b FP16 loads in like a minute and generates about 70t/s. GFX906 looks to be deprecated, so that might be the reason why it doesn't work nicely with Triton. Maybe running with APHRODITE_USE_TRITON_FLASH_ATTN=0 would work better for you?

@Said-Akbar
Copy link
Author

interesting. For me, Flash attention works fine but triton has some issues. Did you also compile triton from https://github.com/ROCm/triton/blob/triton-mlir/python/setup.py ?

Also, you mentioned you could not load models bigger than 70m in aphrodite. But llama3 8b is running fast for you. So, you were able to figure out how to load 70m+ models, right?
Thanks!

@Naomiusearch
Copy link
Contributor

Naomiusearch commented Oct 16, 2024

I just didn't have a bigger model than 70m to test FP16 before(I was trying to load GPTQ earlier), I had to download one. Also I have pytorch-triton 3.1.0+cf34004b8a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants