Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[amdgpu] If exceeding ulimit file handle limit while constructing HIP devices, memory corruption and process crash. #463

Open
stellaraccident opened this issue Nov 9, 2024 · 0 comments
Assignees

Comments

@stellaraccident
Copy link
Contributor

On a 64 device MI300X CPX system with default file handle limit of 1024 (verify with ulimit -n), this can be induced after creating 30 device. Repro:

ROCR_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30 python examples/python/enumerate_devices.py system_type=amdgpu

On a more modest system, this can be repro'd by tweaking the ulimit. Example on the system tested:

ulimit -n 128
ROCR_VISIBLE_DEVICES=1,2 python examples/python/enumerate_devices.py system_type=amdgpu

Error displayed:

Creating system with args: {'system_type': 'amdgpu'}
Traceback (most recent call last):
  File "/home/slaurenz/src/SHARK-Platform/shortfin/examples/python/enumerate_devices.py", line 32, in <module>
    main()
  File "/home/slaurenz/src/SHARK-Platform/shortfin/examples/python/enumerate_devices.py", line 26, in main
    with builder.create_system() as ls:
ValueError: iree/runtime/src/iree/base/internal/wait_handle_posix.c:37: RESOURCE_EXHAUSTED; failed to create eventfd (24)
corrupted double-linked list
Aborted (core dumped)

I've seen different stack traces. Here is one that happens on exit after this event:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350242304) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350242304) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737350242304, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c8c476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c727f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7cd3676 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7e25b77 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007ffff7ceacfc in malloc_printerr (str=str@entry=0x7ffff7e2370e "corrupted double-linked list") at ./malloc/malloc.c:5664
#7  0x00007ffff7ceb7cc in unlink_chunk (p=<optimized out>, av=0x7ffff7e64c80 <main_arena>) at ./malloc/malloc.c:1635
#8  0x00007ffff7ceb969 in malloc_consolidate (av=av@entry=0x7ffff7e64c80 <main_arena>) at ./malloc/malloc.c:4780
#9  0x00007ffff7cecea0 in _int_free (av=0x7ffff7e64c80 <main_arena>, p=0x555556800200, have_lock=<optimized out>) at ./malloc/malloc.c:4674
#10 0x00007ffff7cef453 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#11 0x00007fff4adb401d in __gnu_cxx::new_allocator<_HaCacheProperties>::deallocate (__t=<optimized out>, __p=<optimized out>, this=0x555555f3bab8) at /usr/include/c++/11/ext/new_allocator.h:132
#12 std::allocator_traits<std::allocator<_HaCacheProperties> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:496
#13 std::_Vector_base<_HaCacheProperties, std::allocator<_HaCacheProperties> >::_M_deallocate (__n=<optimized out>, __p=<optimized out>, this=0x555555f3bab8) at /usr/include/c++/11/bits/stl_vector.h:354
#14 std::_Vector_base<_HaCacheProperties, std::allocator<_HaCacheProperties> >::~_Vector_base (this=0x555555f3bab8, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:335
#15 std::vector<_HaCacheProperties, std::allocator<_HaCacheProperties> >::~vector (this=0x555555f3bab8, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:683
#16 rocr::AMD::CpuAgent::~CpuAgent (this=0x555555f3b920, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/amd_cpu_agent.cpp:66
#17 0x00007fff4adb404d in rocr::AMD::CpuAgent::~CpuAgent (this=0x555555f3b920, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/amd_cpu_agent.cpp:66
#18 0x00007fff4adf2efe in rocr::DeleteObject::operator()<rocr::core::Agent> (ptr=<optimized out>, this=<synthetic pointer>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/utils.h:219
#19 std::for_each<__gnu_cxx::__normal_iterator<rocr::core::Agent**, std::vector<rocr::core::Agent*, std::allocator<rocr::core::Agent*> > >, rocr::DeleteObject> (__f=..., __last=..., __first=0x555555f3b920) at /usr/include/c++/11/bits/stl_algo.h:3820
#20 rocr::core::Runtime::DestroyAgents (this=this@entry=0x5555561fdeb0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:255
#21 0x00007fff4adf3ec5 in rocr::core::Runtime::Unload (this=0x5555561fdeb0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:1976
#22 0x00007fff4adf53dc in rocr::core::Runtime::Release () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:160
#23 0x00007fff4add32f7 in rocr::HSA::hsa_shut_down () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:213
#24 0x00007fff5411929a in amd::Runtime::tearDown () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:96
#25 amd::Runtime::tearDown () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:89
#26 0x00007fff541193be in amd::RuntimeTearDown::~RuntimeTearDown (this=<optimized out>, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:119
#27 amd::RuntimeTearDown::~RuntimeTearDown (this=<optimized out>, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:111
#28 0x00007ffff7c8f495 in __run_exit_handlers (status=1, listp=0x7ffff7e64838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#29 0x00007ffff7c8f610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#30 0x00007ffff7c73d97 in __libc_start_call_main (main=main@entry=0x55555577e6d0, argc=argc@entry=3, argv=argv@entry=0x7fffffffe028) at ../sysdeps/nptl/libc_start_call_main.h:74
#31 0x00007ffff7c73e40 in __libc_start_main_impl (main=0x55555577e6d0, argc=3, argv=0x7fffffffe028, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe018) at ../csu/libc-start.c:392
#32 0x000055555577e605 in _start ()

Probably need to get ASAN on it and try to find the root cause.

stellaraccident added a commit that referenced this issue Nov 9, 2024
Without this, on very large system (i.e. 64 GPU / 192 Core) systems, it was not possible to open all devices without manual tweaks to file handle descriptor limits. The result were various forms of RESOURCE_EXHAUSTED errors. This may require more tweaking in the future, and for fully robust setups, production installations should explicitly configure high limits. However, these heuristics remove a significant barrier to entry and provide some feedback in terms of logs.

Progress on #463
stellaraccident added a commit that referenced this issue Nov 9, 2024
…ux. (#465)

Without this, on very large systems (i.e. 64 GPU / 192 Core), it was not
possible to open all devices without manual tweaks to file handle
descriptor limits. The result were various forms of RESOURCE_EXHAUSTED
errors. This may require more tweaking in the future, and for fully
robust setups, production installations should explicitly configure high
limits. However, these heuristics remove a significant barrier to entry
and provide some feedback in terms of logs.

Progress on #463
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants