-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process not responsive dump indicates garbage collection #110350
Comments
Tagging subscribers to this area: @dotnet/gc |
Does this happen during startup? This feels similar to #105780. Are you able to try with disabling new GC mode with DOTNET_GCDynamicAdaptationMode=0 ? The fix for this issue should be included in the Jan servicing release for 9.0. |
It is not during startup. Usually the processes will run days before this happens I have added this to the csproj of the exe of the process wherre we have seen this most often
|
If it's not during startup it could be a different issue. If it has been reproing frequently then yeah disabling DOTNET_GCDynamicAdaptationMode would be worth a try. If you are able to share a dump privately that would help in confirming if it's the same issue. |
I hope you got an e-mail with a link to the dump file |
Thanks for sharing the dump. This isn't related to the DATAS issue I pointed to earlier. In fact the app is using WKS GC. The dump shows something similar to #107800, but in this case its a thread shutdown racing with GC trying to create a BGC thread, so looks to be a deadlock between DetachThread + CreateThread? @VSadov @kouvel @jkotas have you seen something similar before?
|
I believe that this is one of the reasons why native AOT uses |
But from the stacks it looks like both threads wait on CLR events/locks. So maybe this is not related to OS lock? |
Here is the stack of the BGC Thread. Its started but doesnt get to
|
Ah, I see - CreateNonSuspendableThread waits on an event that will be set by the spawned thread. It just needs to see progress form the spawned thread to declare that thread creation was successful. Thus it is indeed possible that the spawned thread creation/progress needs the same loader lock that the thread that is shutting down is holding. |
|
@tornie2, as a temporary workaround you can disable background GC to ensure that avoids the issue. |
This method is new in .NET 9 (I have introduced it in https://github.com/dotnet/runtime/pull/103877/files#diff-f5835c4b5fd134e52b4127bb4ffb7e5ad439673a429dc7ea46d53e7a5bca0529R2734). There was code on thread shutdown path that switched to cooperative mode before .NET 9 as well, so the fundamental problem is not new.
This is classic A-B B-A deadlock. The two locks in question are our thread store lock that's taken when the runtime is suspended for the GC, and the Windows OS loader lock that's taken when threads are created/destroyed by the Windows OS. We would either need get these locks ordered (that's pretty hard) or avoid them to be taken in conflicting order. The FiberDetachCallback that's used for thread shutdown notifications in native AOT does the later. |
Tagging subscribers to this area: @mangod9 |
We are doing more work in cooperative mode during thread shutdown in .NET 9 that makes the dead lock more likely to be hit. I think that this issue is .NET 9 servicing candidate. @VSadov Is this something that you can take a look at? |
Another way to look at it - When a thread terminates, it needs to make its TLAB (Thread-Local Allocation Buffer) parseable, to not leave a portion of GC heap in unparseable state. That is done in FixAllocContext call and cannot be done while GC is in progress, so at least that part of thread termination needs COOP mode. It would be hard to get around. At the same time If the terminating thread waits for GC to complete (while holding loader lock) and GC needs to launch a thread to make progress, there could be a deadlock. This is indeed not new at all. I wonder a bit why we did not see this earlier. Perhaps threads terminating while GC is in progress was not common for some reason. |
Yes, I can take a look. |
Thanks @VSadov for thinking about a fix.
yeah certainly would be good to fix in the next servicing release. |
I have added this to the project where we have seen the problem the most
|
This has been closed |
Its been fixed in Main and will be backported to 9 and should be available in the Feb servicing release. |
@tornie2 since you are able to repro this frequently, would you be able to try a private build to ensure the fix resolves the issue for you? Thx |
Can you can deliver the build as an SDK? This is in our pipeline. It takes the SDK from our local JFrog Artifactory
|
The fix was reverted #110801 since it broke WinForms tests |
We don't have a dump thus are not completely sure but two of our micro-services which experience high load seem to be suffering from this under debian based images, we had to rollback to p.s. to get a mem-dump, it would have to be done somehow in ECS (aws) maybe based on a script running in parallel and doing the same health-check calls?.. this part is also not clear at the moment. p.s.s. this comment is basically a |
Description
After upgrading to .net 9, we have random processes, which just freeze, becoming completely unresponsive
Process are run as windows services on windows VM
I have a dump file, which I could send to you
I would just rather not make that public as it probably has passwords within
Analyzing the dump indicates a possible problem in the garbage collector
0:000> !analyze -v
KEY_VALUES_STRING: 1
FILE_IN_CAB: SmfHaircuts.Service-2024-12-03-YB6213.DMP
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
EXCEPTION_RECORD: (.exr -1)
ExceptionAddress: 0000000000000000
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 0
FAULTING_THREAD: 00000f6c
PROCESS_NAME: SmfHaircuts.Service.dll
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.
EXCEPTION_CODE_STR: 80000003
STACK_TEXT:
0000008c
61d7e028 00007ffc
bdba0f33 : 0000000000000000 000002be
90a8dab0 000002be90a8d9f0 0000008c
61d7e170 : ntdll!NtWaitForSingleObject+0x140000008c
61d7e030 00007ffb
364f1c30 : 0000000000000000 00004612
d1730f35 0000000000000000 00000000
00000284 : KERNELBASE!WaitForSingleObjectEx+0x930000008c
61d7e0d0 00007ffb
36416915 : 0000000000000000 0000008c
61d7e2d0 0000000000000804 0000008c
61d7e1b0 : coreclr!WKS::GCHeap::WaitUntilGCComplete+0x300000008c
61d7e100 00007ffb
364e8328 : 00007ffad68130c0 00000000
00000000 0000000000000000 0000027d
f97fc570 : coreclr!Thread::RareDisablePreemptiveGC+0x9d0000008c
61d7e190 00007ffb
3659ea2d : 00007ffad68130c0 00000000
00000000 000002be906cedf0 00000001
00000000 : coreclr!JIT_ReversePInvokeEnterRare2+0x180000008c
61d7e1c0 00007ffa
d7e5b718 : 0000000000000004 0000008c
61d7e260 0000000000000000 00007ffc
c166598d : coreclr!JIT_ReversePInvokeEnterTrackTransitions+0x9d13d0000008c
61d7e1f0 00000000
00000004 : 0000008c61d7e260 00000000
00000000 00007ffcc166598d 00000000
00000000 : 0x00007ffad7e5b718 0000008c
61d7e1f8 0000008c61d7e260 : 00000000
00000000 00007ffcc166598d 00000000
00000000 0000008c61d7e1f0 : 0x4 0000008c
61d7e200 0000000000000000 : 00000000
00000000 0000000000000000 00000000
00000000 0000000000000000 : 0x0000008c
61d7e260STACK_COMMAND: ~0s; .ecxr ; kb
FAULTING_SOURCE_LINE: D:\a_work\1\s\src\coreclr\gc\gcee.cpp
FAULTING_SOURCE_FILE: D:\a_work\1\s\src\coreclr\gc\gcee.cpp
FAULTING_SOURCE_LINE_NUMBER: 265
FAULTING_SOURCE_SRV_COMMAND: https://raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gcee.cpp
FAULTING_SOURCE_CODE:
No source found for 'D:\a_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp'
SYMBOL_NAME: coreclr!WKS::GCHeap::WaitUntilGCComplete+30
MODULE_NAME: coreclr
IMAGE_NAME: coreclr.dll
FAILURE_BUCKET_ID: BREAKPOINT_80000003_coreclr.dll!WKS::GCHeap::WaitUntilGCComplete
OS_VERSION: 10.0.17763.1
BUILDLAB_STR: rs5_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
IMAGE_VERSION: 9.0.24.52809
FAILURE_ID_HASH: {54e9a6da-d4d0-d004-574b-4219b46bdb8d}
Followup: MachineOwner
Reproduction Steps
Not possilbe. Happens randomly
Expected behavior
Not freezing
Actual behavior
Process completely unresponsive
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: