Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process not responsive dump indicates garbage collection #110350

Open
tornie2 opened this issue Dec 3, 2024 · 25 comments · Fixed by #110589
Open

Process not responsive dump indicates garbage collection #110350

tornie2 opened this issue Dec 3, 2024 · 25 comments · Fixed by #110589
Assignees
Labels
area-VM-coreclr in-pr There is an active PR which will close this issue when it is merged os-windows
Milestone

Comments

@tornie2
Copy link

tornie2 commented Dec 3, 2024

Description

After upgrading to .net 9, we have random processes, which just freeze, becoming completely unresponsive
Process are run as windows services on windows VM

I have a dump file, which I could send to you
I would just rather not make that public as it probably has passwords within

Analyzing the dump indicates a possible problem in the garbage collector

0:000> !analyze -v


  •                                                                         *
    
  •                    Exception Analysis                                   *
    
  •                                                                         *
    

KEY_VALUES_STRING: 1

Key  : Analysis.CPU.mSec
Value: 1484

Key  : Analysis.Elapsed.mSec
Value: 5300

Key  : Analysis.IO.Other.Mb
Value: 0

Key  : Analysis.IO.Read.Mb
Value: 1

Key  : Analysis.IO.Write.Mb
Value: 1

Key  : Analysis.Init.CPU.mSec
Value: 781

Key  : Analysis.Init.Elapsed.mSec
Value: 120611

Key  : Analysis.Memory.CommitPeak.Mb
Value: 223

Key  : Analysis.Version.DbgEng
Value: 10.0.27725.1000

Key  : Analysis.Version.Description
Value: 10.2408.27.01 amd64fre

Key  : Analysis.Version.Ext
Value: 1.2408.27.1

Key  : CLR.Engine
Value: CORECLR

Key  : CLR.Version
Value: 9.0.24.52809

Key  : Failure.Bucket
Value: BREAKPOINT_80000003_coreclr.dll!WKS::GCHeap::WaitUntilGCComplete

Key  : Failure.Hash
Value: {54e9a6da-d4d0-d004-574b-4219b46bdb8d}

Key  : Failure.Source.FileLine
Value: 265

Key  : Failure.Source.FilePath
Value: D:\a\_work\1\s\src\coreclr\gc\gcee.cpp

Key  : Failure.Source.SourceServerCommand
Value: raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gcee.cpp

Key  : Timeline.OS.Boot.DeltaSec
Value: 896327

Key  : Timeline.Process.Start.DeltaSec
Value: 17922

Key  : WER.OS.Branch
Value: rs5_release

Key  : WER.OS.Version
Value: 10.0.17763.1

Key  : WER.Process.Version
Value: 1.0.0.0

FILE_IN_CAB: SmfHaircuts.Service-2024-12-03-YB6213.DMP

NTGLOBALFLAG: 0

APPLICATION_VERIFIER_FLAGS: 0

EXCEPTION_RECORD: (.exr -1)
ExceptionAddress: 0000000000000000
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 0

FAULTING_THREAD: 00000f6c

PROCESS_NAME: SmfHaircuts.Service.dll

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.

EXCEPTION_CODE_STR: 80000003

STACK_TEXT:
0000008c61d7e028 00007ffcbdba0f33 : 0000000000000000 000002be90a8dab0 000002be90a8d9f0 0000008c61d7e170 : ntdll!NtWaitForSingleObject+0x14
0000008c61d7e030 00007ffb364f1c30 : 0000000000000000 00004612d1730f35 0000000000000000 0000000000000284 : KERNELBASE!WaitForSingleObjectEx+0x93
0000008c61d7e0d0 00007ffb36416915 : 0000000000000000 0000008c61d7e2d0 0000000000000804 0000008c61d7e1b0 : coreclr!WKS::GCHeap::WaitUntilGCComplete+0x30
0000008c61d7e100 00007ffb364e8328 : 00007ffad68130c0 0000000000000000 0000000000000000 0000027df97fc570 : coreclr!Thread::RareDisablePreemptiveGC+0x9d
0000008c61d7e190 00007ffb3659ea2d : 00007ffad68130c0 0000000000000000 000002be906cedf0 0000000100000000 : coreclr!JIT_ReversePInvokeEnterRare2+0x18
0000008c61d7e1c0 00007ffad7e5b718 : 0000000000000004 0000008c61d7e260 0000000000000000 00007ffcc166598d : coreclr!JIT_ReversePInvokeEnterTrackTransitions+0x9d13d
0000008c61d7e1f0 0000000000000004 : 0000008c61d7e260 0000000000000000 00007ffcc166598d 0000000000000000 : 0x00007ffad7e5b718 0000008c61d7e1f8 0000008c61d7e260 : 0000000000000000 00007ffcc166598d 0000000000000000 0000008c61d7e1f0 : 0x4 0000008c61d7e200 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : 0x0000008c61d7e260

STACK_COMMAND: ~0s; .ecxr ; kb

FAULTING_SOURCE_LINE: D:\a_work\1\s\src\coreclr\gc\gcee.cpp

FAULTING_SOURCE_FILE: D:\a_work\1\s\src\coreclr\gc\gcee.cpp

FAULTING_SOURCE_LINE_NUMBER: 265

FAULTING_SOURCE_SRV_COMMAND: https://raw.githubusercontent.com/dotnet/runtime/9d5a6a9aa463d6d10b0b0ba6d5982cc82f363dc3/src/coreclr/gc/gcee.cpp

FAULTING_SOURCE_CODE:
No source found for 'D:\a_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp'

SYMBOL_NAME: coreclr!WKS::GCHeap::WaitUntilGCComplete+30

MODULE_NAME: coreclr

IMAGE_NAME: coreclr.dll

FAILURE_BUCKET_ID: BREAKPOINT_80000003_coreclr.dll!WKS::GCHeap::WaitUntilGCComplete

OS_VERSION: 10.0.17763.1

BUILDLAB_STR: rs5_release

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

IMAGE_VERSION: 9.0.24.52809

FAILURE_ID_HASH: {54e9a6da-d4d0-d004-574b-4219b46bdb8d}

Followup: MachineOwner

Reproduction Steps

Not possilbe. Happens randomly

Expected behavior

Not freezing

Actual behavior

Process completely unresponsive

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 3, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 3, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 3, 2024
@mangod9
Copy link
Member

mangod9 commented Dec 3, 2024

Does this happen during startup?

This feels similar to #105780. Are you able to try with disabling new GC mode with DOTNET_GCDynamicAdaptationMode=0 ?

The fix for this issue should be included in the Jan servicing release for 9.0.

@tornie2
Copy link
Author

tornie2 commented Dec 3, 2024

It is not during startup. Usually the processes will run days before this happens

I have added this to the csproj of the exe of the process wherre we have seen this most often
Will this be a temporary fix, until the fix is released?

<ItemGroup>
	<RuntimeHostConfigurationOption Include="DOTNET_GCDynamicAdaptationMode" Value="0" />
</ItemGroup>

@mangod9
Copy link
Member

mangod9 commented Dec 3, 2024

If it's not during startup it could be a different issue. If it has been reproing frequently then yeah disabling DOTNET_GCDynamicAdaptationMode would be worth a try. If you are able to share a dump privately that would help in confirming if it's the same issue.

@tornie2
Copy link
Author

tornie2 commented Dec 4, 2024

I hope you got an e-mail with a link to the dump file

@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

Thanks for sharing the dump. This isn't related to the DATAS issue I pointed to earlier. In fact the app is using WKS GC. The dump shows something similar to #107800, but in this case its a thread shutdown racing with GC trying to create a BGC thread, so looks to be a deadlock between DetachThread + CreateThread? @VSadov @kouvel @jkotas have you seen something similar before?

 # Child-SP          RetAddr               Call Site
00 000000de`8877f298 00007ff8`dce10f33     ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 211] 
01 000000de`8877f2a0 00007ff8`5a3f1c30     KERNELBASE!WaitForSingleObjectEx+0x93 [minkernel\kernelbase\synch.c @ 1328] 
02 (Inline Function) --------`--------     coreclr!GCEvent::Impl::Wait+0xf [D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp @ 1381] 
03 (Inline Function) --------`--------     coreclr!GCEvent::Wait+0x16 [D:\a\_work\1\s\src\coreclr\gc\windows\gcenv.windows.cpp @ 1431] 
04 000000de`8877f340 00007ff8`5a316915     coreclr!WKS::GCHeap::WaitUntilGCComplete+0x30 [D:\a\_work\1\s\src\coreclr\gc\gcee.cpp @ 265] 
05 000000de`8877f370 00007ff8`5a2cc828     coreclr!Thread::RareDisablePreemptiveGC+0x9d [D:\a\_work\1\s\src\coreclr\vm\threadsuspend.cpp @ 2212] 
06 (Inline Function) --------`--------     coreclr!Thread::DisablePreemptiveGC+0x1f [D:\a\_work\1\s\src\coreclr\vm\threads.h @ 1297] 
07 (Inline Function) --------`--------     coreclr!GCHolderBase::EnterInternalCoop+0x37 [D:\a\_work\1\s\src\coreclr\vm\threads.h @ 4712] 
08 000000de`8877f400 00007ff8`5a3ac7b8     coreclr!GCCoop::GCCoop+0x54 [D:\a\_work\1\s\src\coreclr\vm\threads.h @ 4832] 
09 000000de`8877f430 00007ff8`5a3ac6ea     coreclr!Thread::CooperativeCleanup+0x24 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 2737] 
0a 000000de`8877f480 00007ff8`5a3ac60e     coreclr!Thread::DetachThread+0x9a [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 936] 
0b 000000de`8877f4b0 00007ff8`5a408c83     coreclr!TlsDestructionMonitor::~TlsDestructionMonitor+0x62 [D:\a\_work\1\s\src\coreclr\vm\ceemain.cpp @ 1744] 
0c 000000de`8877f4f0 00007ff8`dfe75d37     coreclr!__dyn_tls_dtor+0x63 [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\tls\tlsdtor.cpp @ 119] 
0d 000000de`8877f520 00007ff8`dfe75e6b     ntdll!LdrpCallInitRoutine+0x6f [minkernel\ntdll\ldr.c @ 212] 
0e 000000de`8877f590 00007ff8`dfe733e1     ntdll!LdrpCallTlsInitializers+0x87 [minkernel\ntdll\ldrtls.c @ 1067] 
0f 000000de`8877f610 00007ff8`dfeaa92e     ntdll!LdrShutdownThread+0x141 [minkernel\ntdll\ldrinit.c @ 6354] 
10 000000de`8877f710 00007ff8`dfe66f06     ntdll!RtlExitUserThread+0x3e [minkernel\ntdll\rtlstrt.c @ 2110] 
11 000000de`8877f750 00007ff8`df897ac4     ntdll!TppWorkerThread+0xbe6 [minkernel\threadpool\ntdll\worker.c @ 1286] 
12 000000de`8877fa40 00007ff8`dfeaa8c1     kernel32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64] 
13 000000de`8877fa70 00000000`00000000     ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1163] 
00 000000de`8727f128 00007ff8`dce10f33     ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 211] 
01 000000de`8727f130 00007ff8`5a3895d4     KERNELBASE!WaitForSingleObjectEx+0x93 [minkernel\kernelbase\synch.c @ 1328] 
02 (Inline Function) --------`--------     coreclr!CLREventWaitHelper2+0x6 [D:\a\_work\1\s\src\coreclr\vm\synch.cpp @ 372] 
03 000000de`8727f1d0 00007ff8`5a3ab8f2     coreclr!CLREventWaitHelper+0x20 [D:\a\_work\1\s\src\coreclr\vm\synch.cpp @ 397] 
04 (Inline Function) --------`--------     coreclr!CLREventBase::WaitEx+0x10 [D:\a\_work\1\s\src\coreclr\vm\synch.cpp @ 466] 
05 (Inline Function) --------`--------     coreclr!CLREventBase::Wait+0x10 [D:\a\_work\1\s\src\coreclr\vm\synch.cpp @ 412] 
06 000000de`8727f230 00007ff8`5a3abb98     coreclr!`anonymous namespace'::CreateSuspendableThread+0x10e [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 1481] 
07 000000de`8727f300 00007ff8`5a400ac8     coreclr!GCToEEInterface::CreateThread+0x154 [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 1568] 
08 (Inline Function) --------`--------     coreclr!WKS::gc_heap::create_bgc_thread+0x18 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 39484] 
09 000000de`8727f4e0 00007ff8`5a314860     coreclr!WKS::gc_heap::prepare_bgc_thread+0x4c [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 39443] 
0a 000000de`8727f510 00007ff8`5a31756e     coreclr!WKS::gc_heap::garbage_collect+0x2e4 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 24384] 
0b 000000de`8727f560 00007ff8`5a5ab2a0     coreclr!WKS::GCHeap::GarbageCollectGeneration+0x13e [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 51065] 
0c 000000de`8727f5c0 00007ff8`5a4751b3     coreclr!WKS::GCHeap::GarbageCollect+0x110 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 50191] 
0d (Inline Function) --------`--------     coreclr!ThreadStore::TriggerGCForDeadThreadsIfNecessary+0xec2cc [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5483] 
0e 000000de`8727f600 00007ff8`5a38942a     coreclr!Thread::DoExtraWorkForFinalizer+0xec3af [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7051] 
0f 000000de`8727f670 00007ff8`5a385131     coreclr!FinalizerThread::FinalizerThreadWorker+0xca [D:\a\_work\1\s\src\coreclr\vm\finalizerthread.cpp @ 407] 
10 (Inline Function) --------`--------     coreclr!ManagedThreadBase_DispatchInner+0xd [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7110] 
11 000000de`8727f8c0 00007ff8`5a38504b     coreclr!ManagedThreadBase_DispatchMiddle+0x81 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7154] 
12 000000de`8727f970 00007ff8`5a3c4201     coreclr!ManagedThreadBase_DispatchOuter+0xab [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7313] 
13 (Inline Function) --------`--------     coreclr!ManagedThreadBase_NoADTransition+0x28 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7382] 
14 (Inline Function) --------`--------     coreclr!ManagedThreadBase::FinalizerBase+0x28 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7401] 
15 000000de`8727fa10 00007ff8`df897ac4     coreclr!FinalizerThread::FinalizerThreadStart+0x91 [D:\a\_work\1\s\src\coreclr\vm\finalizerthread.cpp @ 464] 
16 000000de`8727fb20 00007ff8`dfeaa8c1     kernel32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64] 
17 000000de`8727fb50 00000000`00000000     ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1163] 

@jkotas
Copy link
Member

jkotas commented Dec 5, 2024

have you seen something similar before?

I believe that this is one of the reasons why native AOT uses FiberDetachCallback for thread shutdown notifications. (FiberDetachCallback does not run under loader lock.)

@VSadov
Copy link
Member

VSadov commented Dec 5, 2024

But from the stacks it looks like both threads wait on CLR events/locks. So maybe this is not related to OS lock?

@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

Here is the stack of the BGC Thread. Its started but doesnt get to gc_heap::bgc_thread_stub due to the loader lock. There are other threads in a similar state.

  22  Id: 55ec.2e90 Suspend: 0 Teb: 000000de`86728000 Unfrozen ".NET BGC"
 # RetAddr               : Args to Child                                                           : Call Site
00 00007ff8`dfe783f5     : 00007ff8`dffb52b0 000000de`8664d000 000000de`8757f3c0 00000000`00002000 : ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 211] 
01 00007ff8`dfe735f7     : 00000000`00000000 00000000`00000000 000000de`8664d000 00000000`0000000f : ntdll!LdrpDrainWorkQueue+0x15d [minkernel\ntdll\ldrmap.c @ 3142] 
02 00007ff8`dfec8b25     : 00000000`00000000 00000000`00000000 00000000`00000001 00000000`00000000 : ntdll!LdrpInitializeThread+0x8b [minkernel\ntdll\ldrinit.c @ 6528] 
03 00007ff8`dfec8703     : 00000000`00000000 00007ff8`dfe50000 00000000`00000000 000000de`86728000 : ntdll!_LdrpInitialize+0x409 [minkernel\ntdll\ldrinit.c @ 1838] 
04 00007ff8`dfec86ae     : 000000de`8757f3c0 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!LdrpInitialize+0x3b [minkernel\ntdll\ldrinit.c @ 1435] 
05 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!LdrInitializeThunk+0xe [minkernel\ntdll\ldrstart.c @ 91] 

  23  Id: 55ec.6370 Suspend: 0 Teb: 000000de`8672a000 Unfrozen
 # RetAddr               : Args to Child                                                           : Call Site
00 00007ff8`dfe783f5     : 00007ff8`dffb52b0 000000de`8664d000 000000de`87b7f460 00000000`00002000 : ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 211] 
01 00007ff8`dfe735f7     : 00000000`00000000 00000000`00000000 000000de`8664d000 00000000`0000000f : ntdll!LdrpDrainWorkQueue+0x15d [minkernel\ntdll\ldrmap.c @ 3142] 
02 00007ff8`dfec8b25     : 00000000`00000000 00000000`00000000 00000000`00000001 00000000`00000000 : ntdll!LdrpInitializeThread+0x8b [minkernel\ntdll\ldrinit.c @ 6528] 
03 00007ff8`dfec8703     : 00000000`00000000 00007ff8`dfe50000 00000000`00000000 000000de`8672a000 : ntdll!_LdrpInitialize+0x409 [minkernel\ntdll\ldrinit.c @ 1838] 
04 00007ff8`dfec86ae     : 000000de`87b7f460 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!LdrpInitialize+0x3b [minkernel\ntdll\ldrinit.c @ 1435] 
05 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!LdrInitializeThunk+0xe [minkernel\ntdll\ldrstart.c @ 91] 

@VSadov
Copy link
Member

VSadov commented Dec 5, 2024

Ah, I see - CreateNonSuspendableThread waits on an event that will be set by the spawned thread. It just needs to see progress form the spawned thread to declare that thread creation was successful.

Thus it is indeed possible that the spawned thread creation/progress needs the same loader lock that the thread that is shutting down is holding.

@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

Thread::CooperativeCleanup is new in 9, perhaps it should synchronize whether GC is in progress (well mainly if its trying to spawn a BGC thread)?

@mangod9
Copy link
Member

mangod9 commented Dec 5, 2024

@tornie2, as a temporary workaround you can disable background GC to ensure that avoids the issue.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Dec 6, 2024
@mangod9 mangod9 added this to the 10.0.0 milestone Dec 6, 2024
@jkotas
Copy link
Member

jkotas commented Dec 6, 2024

Thread::CooperativeCleanup is new in 9

This method is new in .NET 9 (I have introduced it in https://github.com/dotnet/runtime/pull/103877/files#diff-f5835c4b5fd134e52b4127bb4ffb7e5ad439673a429dc7ea46d53e7a5bca0529R2734). There was code on thread shutdown path that switched to cooperative mode before .NET 9 as well, so the fundamental problem is not new.

perhaps it should synchronize whether GC is in progress

This is classic A-B B-A deadlock. The two locks in question are our thread store lock that's taken when the runtime is suspended for the GC, and the Windows OS loader lock that's taken when threads are created/destroyed by the Windows OS. We would either need get these locks ordered (that's pretty hard) or avoid them to be taken in conflicting order. The FiberDetachCallback that's used for thread shutdown notifications in native AOT does the later.

Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Dec 6, 2024

We are doing more work in cooperative mode during thread shutdown in .NET 9 that makes the dead lock more likely to be hit. I think that this issue is .NET 9 servicing candidate. @VSadov Is this something that you can take a look at?

@VSadov
Copy link
Member

VSadov commented Dec 6, 2024

Another way to look at it - When a thread terminates, it needs to make its TLAB (Thread-Local Allocation Buffer) parseable, to not leave a portion of GC heap in unparseable state. That is done in FixAllocContext call and cannot be done while GC is in progress, so at least that part of thread termination needs COOP mode. It would be hard to get around.

At the same time If the terminating thread waits for GC to complete (while holding loader lock) and GC needs to launch a thread to make progress, there could be a deadlock.

This is indeed not new at all. I wonder a bit why we did not see this earlier. Perhaps threads terminating while GC is in progress was not common for some reason.

@VSadov
Copy link
Member

VSadov commented Dec 6, 2024

@VSadov Is this something that you can take a look at?

Yes, I can take a look.
Using FiberDetachCallback like in NativeAOT should not have this issue as it does not run user code while holding loader lock.

@VSadov VSadov self-assigned this Dec 6, 2024
@mangod9
Copy link
Member

mangod9 commented Dec 6, 2024

Thanks @VSadov for thinking about a fix.

I think that this issue is .NET 9 servicing candidate

yeah certainly would be good to fix in the next servicing release.

@tornie2
Copy link
Author

tornie2 commented Dec 6, 2024

I have added this to the project where we have seen the problem the most
If this service can run for a week without failing, then I would be optimistic this is the issue

<PropertyGroup>
	<ConcurrentGarbageCollection>false</ConcurrentGarbageCollection>
</PropertyGroup>

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Dec 10, 2024
@tornie2
Copy link
Author

tornie2 commented Dec 12, 2024

This has been closed
Does that mean a fix has been produced?

@mangod9
Copy link
Member

mangod9 commented Dec 12, 2024

Its been fixed in Main and will be backported to 9 and should be available in the Feb servicing release.

@mangod9
Copy link
Member

mangod9 commented Dec 12, 2024

@tornie2 since you are able to repro this frequently, would you be able to try a private build to ensure the fix resolves the issue for you? Thx

@tornie2
Copy link
Author

tornie2 commented Dec 13, 2024

Can you can deliver the build as an SDK?

This is in our pipeline. It takes the SDK from our local JFrog Artifactory
I could add a private build SDK to our JFrog Artifactory

steps: - task: DotNetCoreInstaller@0 displayName: 'Use latest .NET Core sdk' inputs: version: 9.0.100

@jkotas
Copy link
Member

jkotas commented Dec 18, 2024

The fix was reverted #110801 since it broke WinForms tests

@jkotas jkotas reopened this Dec 18, 2024
@mdonatas-trafi
Copy link

We don't have a dump thus are not completely sure but two of our micro-services which experience high load seem to be suffering from this under debian based images, we had to rollback to net8.0.
This turns out to be more disruptive than a sudden crash as due to our health-check configuration the process is terminated after 5 minutes leaving a service unable to serve requests for ~6 minutes.

p.s. to get a mem-dump, it would have to be done somehow in ECS (aws) maybe based on a script running in parallel and doing the same health-check calls?.. this part is also not clear at the moment.

p.s.s. this comment is basically a +1 so you could gauge the impact

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-coreclr in-pr There is an active PR which will close this issue when it is merged os-windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants