You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Creating system with args: {'system_type': 'amdgpu'}
Traceback (most recent call last):
File "/home/slaurenz/src/SHARK-Platform/shortfin/examples/python/enumerate_devices.py", line 32, in <module>
main()
File "/home/slaurenz/src/SHARK-Platform/shortfin/examples/python/enumerate_devices.py", line 26, in main
with builder.create_system() as ls:
ValueError: iree/runtime/src/iree/base/internal/wait_handle_posix.c:37: RESOURCE_EXHAUSTED; failed to create eventfd (24)
corrupted double-linked list
Aborted (core dumped)
I've seen different stack traces. Here is one that happens on exit after this event:
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350242304) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737350242304) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737350242304, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007ffff7c8c476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff7c727f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007ffff7cd3676 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7e25b77 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#6 0x00007ffff7ceacfc in malloc_printerr (str=str@entry=0x7ffff7e2370e "corrupted double-linked list") at ./malloc/malloc.c:5664
#7 0x00007ffff7ceb7cc in unlink_chunk (p=<optimized out>, av=0x7ffff7e64c80 <main_arena>) at ./malloc/malloc.c:1635
#8 0x00007ffff7ceb969 in malloc_consolidate (av=av@entry=0x7ffff7e64c80 <main_arena>) at ./malloc/malloc.c:4780
#9 0x00007ffff7cecea0 in _int_free (av=0x7ffff7e64c80 <main_arena>, p=0x555556800200, have_lock=<optimized out>) at ./malloc/malloc.c:4674
#10 0x00007ffff7cef453 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#11 0x00007fff4adb401d in __gnu_cxx::new_allocator<_HaCacheProperties>::deallocate (__t=<optimized out>, __p=<optimized out>, this=0x555555f3bab8) at /usr/include/c++/11/ext/new_allocator.h:132
#12 std::allocator_traits<std::allocator<_HaCacheProperties> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:496
#13 std::_Vector_base<_HaCacheProperties, std::allocator<_HaCacheProperties> >::_M_deallocate (__n=<optimized out>, __p=<optimized out>, this=0x555555f3bab8) at /usr/include/c++/11/bits/stl_vector.h:354
#14 std::_Vector_base<_HaCacheProperties, std::allocator<_HaCacheProperties> >::~_Vector_base (this=0x555555f3bab8, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:335
#15 std::vector<_HaCacheProperties, std::allocator<_HaCacheProperties> >::~vector (this=0x555555f3bab8, __in_chrg=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:683
#16 rocr::AMD::CpuAgent::~CpuAgent (this=0x555555f3b920, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/amd_cpu_agent.cpp:66
#17 0x00007fff4adb404d in rocr::AMD::CpuAgent::~CpuAgent (this=0x555555f3b920, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/amd_cpu_agent.cpp:66
#18 0x00007fff4adf2efe in rocr::DeleteObject::operator()<rocr::core::Agent> (ptr=<optimized out>, this=<synthetic pointer>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/utils.h:219
#19 std::for_each<__gnu_cxx::__normal_iterator<rocr::core::Agent**, std::vector<rocr::core::Agent*, std::allocator<rocr::core::Agent*> > >, rocr::DeleteObject> (__f=..., __last=..., __first=0x555555f3b920) at /usr/include/c++/11/bits/stl_algo.h:3820
#20 rocr::core::Runtime::DestroyAgents (this=this@entry=0x5555561fdeb0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:255
#21 0x00007fff4adf3ec5 in rocr::core::Runtime::Unload (this=0x5555561fdeb0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:1976
#22 0x00007fff4adf53dc in rocr::core::Runtime::Release () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:160
#23 0x00007fff4add32f7 in rocr::HSA::hsa_shut_down () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:213
#24 0x00007fff5411929a in amd::Runtime::tearDown () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:96
#25 amd::Runtime::tearDown () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:89
#26 0x00007fff541193be in amd::RuntimeTearDown::~RuntimeTearDown (this=<optimized out>, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:119
#27 amd::RuntimeTearDown::~RuntimeTearDown (this=<optimized out>, __in_chrg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/external/clr/rocclr/platform/runtime.cpp:111
#28 0x00007ffff7c8f495 in __run_exit_handlers (status=1, listp=0x7ffff7e64838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#29 0x00007ffff7c8f610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#30 0x00007ffff7c73d97 in __libc_start_call_main (main=main@entry=0x55555577e6d0, argc=argc@entry=3, argv=argv@entry=0x7fffffffe028) at ../sysdeps/nptl/libc_start_call_main.h:74
#31 0x00007ffff7c73e40 in __libc_start_main_impl (main=0x55555577e6d0, argc=3, argv=0x7fffffffe028, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe018) at ../csu/libc-start.c:392
#32 0x000055555577e605 in _start ()
Probably need to get ASAN on it and try to find the root cause.
The text was updated successfully, but these errors were encountered:
Without this, on very large system (i.e. 64 GPU / 192 Core) systems, it was not possible to open all devices without manual tweaks to file handle descriptor limits. The result were various forms of RESOURCE_EXHAUSTED errors. This may require more tweaking in the future, and for fully robust setups, production installations should explicitly configure high limits. However, these heuristics remove a significant barrier to entry and provide some feedback in terms of logs.
Progress on #463
…ux. (#465)
Without this, on very large systems (i.e. 64 GPU / 192 Core), it was not
possible to open all devices without manual tweaks to file handle
descriptor limits. The result were various forms of RESOURCE_EXHAUSTED
errors. This may require more tweaking in the future, and for fully
robust setups, production installations should explicitly configure high
limits. However, these heuristics remove a significant barrier to entry
and provide some feedback in terms of logs.
Progress on #463
On a 64 device MI300X CPX system with default file handle limit of 1024 (verify with
ulimit -n
), this can be induced after creating 30 device. Repro:On a more modest system, this can be repro'd by tweaking the ulimit. Example on the system tested:
Error displayed:
I've seen different stack traces. Here is one that happens on exit after this event:
Probably need to get ASAN on it and try to find the root cause.
The text was updated successfully, but these errors were encountered: