[amdgpu] If exceeding ulimit file handle limit while constructing HIP devices, memory corruption and process crash. #463

stellaraccident · 2024-11-09T01:17:03Z

On a 64 device MI300X CPX system with default file handle limit of 1024 (verify with ulimit -n), this can be induced after creating 30 device. Repro:

ROCR_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30 python examples/python/enumerate_devices.py system_type=amdgpu

On a more modest system, this can be repro'd by tweaking the ulimit. Example on the system tested:

ulimit -n 128
ROCR_VISIBLE_DEVICES=1,2 python examples/python/enumerate_devices.py system_type=amdgpu

Error displayed:

Creating system with args: {'system_type': 'amdgpu'}
Traceback (most recent call last):
  File "/home/slaurenz/src/SHARK-Platform/shortfin/examples/python/enumerate_devices.py", line 32, in <module>
    main()
  File "/home/slaurenz/src/SHARK-Platform/shortfin/examples/python/enumerate_devices.py", line 26, in main
    with builder.create_system() as ls:
ValueError: iree/runtime/src/iree/base/internal/wait_handle_posix.c:37: RESOURCE_EXHAUSTED; failed to create eventfd (24)
corrupted double-linked list
Aborted (core dumped)

I've seen different stack traces. Here is one that happens on exit after this event:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350242304) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350242304) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737350242304, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c8c476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c727f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7cd3676 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7e25b77 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007ffff7ceacfc in malloc_printerr (str=str@entry=0x7ffff7e2370e "corrupted double-linked list") at ./malloc/malloc.c:5664
#7  0x00007ffff7ceb7cc in unlink_chunk (p=<optimized out>, av=0x7ffff7e64c80 <main_arena>) at ./malloc/malloc.c:1635
#8  0x00007ffff7ceb969 in malloc_consolidate (av=av@entry=0x7ffff7e64c80 <main_arena>) at ./malloc/malloc.c:4780
#9  0x00007ffff7cecea0 in _int_free (av=0x7ffff7e64c80 <main_arena>, p=0x555556800200, have_lock=<optimized out>) at ./malloc/malloc.c:4674
#10 0x00007ffff7cef453 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#11 0x00007fff4adb401d in __gnu_cxx::new_allocator<_HaCacheProperties>::deallocate (__t=<optimized out>, __p=<optimized out>, this=0x555555f3bab8) at /usr/include/c++/11/ext/new_allocator.h:132
#12 std::allocator_traits<std::allocator<_HaCacheProperties> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:496
#13 std::_Vector_base<_HaCacheProperties, std::allocator<_HaCacheProperties> >::_M_deallocate (__n=<optimized out>, __p=<optimized out>, this=0x555555f3bab8) at /usr/include/c++/11/bits/stl_vector.h:354
#14 std::_Vector_base<_HaCacheProperties, std::allocator<_HaCacheProperties> >::~_Vector_base (this=0x555555f3bab8, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:335
#15 std::vector<_HaCacheProperties, std::allocator<_HaCacheProperties> >::~vector (this=0x555555f3bab8, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:683
#16 rocr::AMD::CpuAgent::~CpuAgent (this=0x555555f3b920, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/amd_cpu_agent.cpp:66
#17 0x00007fff4adb404d in rocr::AMD::CpuAgent::~CpuAgent (this=0x555555f3b920, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/amd_cpu_agent.cpp:66
#18 0x00007fff4adf2efe in rocr::DeleteObject::operator()<rocr::core::Agent> (ptr=<optimized out>, this=<synthetic pointer>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/utils.h:219
#19 std::for_each<__gnu_cxx::__normal_iterator<rocr::core::Agent**, std::vector<rocr::core::Agent*, std::allocator<rocr::core::Agent*> > >, rocr::DeleteObject> (__f=..., __last=..., __first=0x555555f3b920) at /usr/include/c++/11/bits/stl_algo.h:3820
#20 rocr::core::Runtime::DestroyAgents (this=this@entry=0x5555561fdeb0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:255
#21 0x00007fff4adf3ec5 in rocr::core::Runtime::Unload (this=0x5555561fdeb0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:1976
#22 0x00007fff4adf53dc in rocr::core::Runtime::Release () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:160
#23 0x00007fff4add32f7 in rocr::HSA::hsa_shut_down () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:213
#24 0x00007fff5411929a in amd::Runtime::tearDown () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:96
#25 amd::Runtime::tearDown () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:89
#26 0x00007fff541193be in amd::RuntimeTearDown::~RuntimeTearDown (this=<optimized out>, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:119
#27 amd::RuntimeTearDown::~RuntimeTearDown (this=<optimized out>, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:111
#28 0x00007ffff7c8f495 in __run_exit_handlers (status=1, listp=0x7ffff7e64838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#29 0x00007ffff7c8f610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#30 0x00007ffff7c73d97 in __libc_start_call_main (main=main@entry=0x55555577e6d0, argc=argc@entry=3, argv=argv@entry=0x7fffffffe028) at ../sysdeps/nptl/libc_start_call_main.h:74
#31 0x00007ffff7c73e40 in __libc_start_main_impl (main=0x55555577e6d0, argc=3, argv=0x7fffffffe028, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe018) at ../csu/libc-start.c:392
#32 0x000055555577e605 in _start ()

Probably need to get ASAN on it and try to find the root cause.

The text was updated successfully, but these errors were encountered:

Without this, on very large system (i.e. 64 GPU / 192 Core) systems, it was not possible to open all devices without manual tweaks to file handle descriptor limits. The result were various forms of RESOURCE_EXHAUSTED errors. This may require more tweaking in the future, and for fully robust setups, production installations should explicitly configure high limits. However, these heuristics remove a significant barrier to entry and provide some feedback in terms of logs. Progress on #463

…ux. (#465) Without this, on very large systems (i.e. 64 GPU / 192 Core), it was not possible to open all devices without manual tweaks to file handle descriptor limits. The result were various forms of RESOURCE_EXHAUSTED errors. This may require more tweaking in the future, and for fully robust setups, production installations should explicitly configure high limits. However, these heuristics remove a significant barrier to entry and provide some feedback in terms of logs. Progress on #463

stellaraccident assigned AWoloszyn Nov 9, 2024

stellaraccident mentioned this issue Nov 9, 2024

[shortfin] Add heuristics for adjusting file descriptor limits on Linux. #465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[amdgpu] If exceeding ulimit file handle limit while constructing HIP devices, memory corruption and process crash. #463

[amdgpu] If exceeding ulimit file handle limit while constructing HIP devices, memory corruption and process crash. #463

stellaraccident commented Nov 9, 2024

[amdgpu] If exceeding ulimit file handle limit while constructing HIP devices, memory corruption and process crash. #463

[amdgpu] If exceeding ulimit file handle limit while constructing HIP devices, memory corruption and process crash. #463

Comments

stellaraccident commented Nov 9, 2024