Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows guest bluescreen with hypervisor-fw #153

Open
weltling opened this issue Sep 22, 2021 · 5 comments
Open

Windows guest bluescreen with hypervisor-fw #153

weltling opened this issue Sep 22, 2021 · 5 comments

Comments

@weltling
Copy link
Member

Windows guest with CH using hypervisor-fw instead of OVMF doesn't shutdown correctly and encounters a bluescreen:

SAC>                                                                            
The SAC will become unavailable soon.  The computer is shutting down.           
                                                                                
SAC><?xml><BP>                                                                  
<INSTANCE CLASSNAME="BLUESCREEN">                                               
<PROPERTY NAME="STOPCODE" TYPE="string"><VALUE>"0x7E"</VALUE></PROPERTY><machine-info>
<name>WIN-L3C8M6IQS0Q</name>                                                    
<guid>00000000-0000-0000-0000-000000000000</guid>                               
<processor-architecture>AMD64</processor-architecture>                          
<os-version>10.0</os-version>                                                   
<os-build-number>17763</os-build-number>                                        
<os-product>Windows Server 2019</os-product>                                    
<os-service-pack>None</os-service-pack>                                         
</machine-info>                                                                 

</INSTANCE>
</BP>
!SAC>
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED


0xFFFFFFFF80000003
0xFFFFF80137EB3B7B
0xFFFFDB8C5996F388
0xFFFFDB8C5996EBD0

The Cloud Hypervisor process keeps hanging and doesn't terminate. To reproduce, it's just about booting the guest and then hitting the shutdown button. This issue doesn't happen with OVMF.

As OVMF is currently used for the tests and seems to be the most stable option, we should first clarify on the priority switching to hypervisor-fw.

The guest will need to be debugged the usual way, in first place to identify the issue. Any hints to debug on the firmware side might be helpful, too.

@rbradford
Copy link
Member

@weltling MSHV or KVM? I know we test RFW against Windows on its CI.

@rbradford
Copy link
Member

(But we might not test shutdown.)

@rbradford rbradford transferred this issue from cloud-hypervisor/cloud-hypervisor Sep 23, 2021
@weltling
Copy link
Member Author

The description is about KVM, with MSHV looks same code 0x7E and exception:

0xFFFFFFFF80000003
0xFFFFF8015DC4FB7B
0xFFFFEF03AB0BD388
0xFFFFEF03AB0BCBD0

We indeed don't explicitly test shutdown in the CH integration tests, it's always just waiting 1 minute and then killing the guest. I've got at least one similar issue at shutdown to report to CH (not hypervisor-fw related), but digging yet.

I'll try to run the integration tests replacing with the latest hypervisor-fw yet.

Thanks

@weltling
Copy link
Member Author

I patched the script locally to pick hypervisor-fw instead of OVMF and invoked the integration test suite under KVM - it doesn't show any firmware specific issues. As expected, this issue is not caught by the tests. It might be, that test_windows_guest_netdev_hotplug is a bit unstable, but it's not relevant for this particular report.

Given OVMF is currently used, we need to clarify on the priority switching to hypervisor-fw. A work item to be separated from here could be to add an explicit shutdown test to CH integration suite. While shutdown crashes are probably not that bad, still nice to be fixed.

Thanks

@weltling
Copy link
Member Author

With debugger attached, I can see two crashes.

  1. Happens at boot, most likely a timing issue:
0: kd> k
 # Child-SP          RetAddr               Call Site
00 fffff803`18855b78 fffff803`165ec8e8     nt!DbgBreakPointWithStatus
01 fffff803`18855b80 fffff803`1662ed06     nt!KdCheckForDebugBreak+0x928c0
02 fffff803`18855bb0 fffff803`164cb3f4     nt!KeAccumulateTicks+0x1607d6
03 (Inline Function) --------`--------     nt!KiUpdateRunTime+0x43
04 (Inline Function) --------`--------     nt!KiUpdateTime+0x42a
05 fffff803`18855c10 fffff803`16e88332     nt!KeClockInterruptNotify+0x604
06 (Inline Function) --------`--------     hal!HalpTimerClockInterruptEpilogCommon+0xe
07 (Inline Function) --------`--------     hal!HalpTimerClockInterruptCommon+0xdc
08 fffff803`18855f30 fffff803`16425c65     hal!HalpTimerClockInterrupt+0xf2
09 fffff803`18855f60 fffff803`165d03ca     nt!KiCallInterruptServiceRoutine+0xa5
0a fffff803`18855fb0 fffff803`165d0917     nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
0b fffff803`18846590 fffff803`16ea09cf     nt!KiInterruptDispatchNoLockNoEtw+0x37
0c fffff803`18846728 fffff803`1659c816     hal!HalProcessorIdle+0xf
0d fffff803`18846730 fffff803`164cd1bb     nt!PpmIdleDefaultExecute+0x16
0e fffff803`18846760 fffff803`164cc96f     nt!PpmIdleExecuteTransition+0x6bb
0f fffff803`18846a80 fffff803`165d23bc     nt!PoIdle+0x33f
10 fffff803`18846be0 00000000`00000000     nt!KiIdleLoop+0x2c

This one seems to happen because the boot went too slowly and ticks expire too fast. Continuing through this one seems to get the system going, though.

  1. This one is at shutdown.
00 fffff803`18859c10 fffff803`16eb5da4     hal!HalpPowerWriteResetCommand+0x10f
01 fffff803`18859c50 fffff803`16eb7381     hal!HalpInterruptResetThisProcessor+0x164
02 fffff803`18859c80 fffff803`16ebef4a     hal!HalpInterruptRebootService+0x41
03 fffff803`18859cb0 fffff803`166a21d0     hal!HalpPreprocessNmi+0x2a
04 fffff803`18859ce0 fffff803`165d9c02     nt!KiProcessNMI+0x30
05 fffff803`18859d30 fffff803`165d99c6     nt!KxNmiInterrupt+0x82
06 fffff803`18859e70 fffff803`16eb7bba     nt!KiNmiInterrupt+0x206
07 ffff998d`e405f720 fffff803`16eb7863     hal!HalpShutdown+0x2a
08 ffff998d`e405f780 fffff803`16eb7a5e     hal!HalReturnToFirmware
09 ffff998d`e405f7b0 fffff803`169805ce     hal!HalpLegacyShutdown+0xe
0a ffff998d`e405f7e0 fffff803`1698033a     nt!PopHandleNextState+0x1ee
0b ffff998d`e405f830 fffff803`16980030     nt!PopIssueNextState+0x1a
0c ffff998d`e405f860 fffff803`16995010     nt!PopInvokeSystemStateHandler+0x29c
0d ffff998d`e405fa70 fffff803`16993c1a     nt!PopShutdownSystem+0x8c
0e ffff998d`e405fab0 fffff803`1650320a     nt!PopGracefulShutdown+0x2ea
0f ffff998d`e405faf0 fffff803`164709d5     nt!ExpWorkerThread+0x16a
10 ffff998d`e405fb90 fffff803`165d5e3c     nt!PspSystemThreadStartup+0x55
11 ffff998d`e405fbe0 00000000`00000000     nt!KiStartSystemThread+0x1c

Both cases seem to land in the code path invoking DbgBreakPoint(), whereby the second one is conditioned with the firmware being EFI. Also, i don't seem to run into the second code path at all with OVMF. Perhaps comparing on what exactly is provided vs. used wrt hypervisor-f and OVMF could help, too, as how it still looks like the issue is firmware dependent.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants