Solved: Blue Screens when suspending/resuming Windows Hype...

pscott2 · ‎07-18-2014

We use ESX to virtualize ESXi and Windows Hyper-V machines for training scenarios. We have dozens (even hundreds) of these VMs in use across a large farm of ESX hosts at any given time. Students are able to suspend their VMs at any time and return to them later. This worked flawlessly on ESXi 5.1. Our troubles began when we upgraded to ESXi 5.5. After the upgrade, the Windows Hyper-V guests began crashing with a blue screen sometime during the suspend/resume process. We reverted to 5.1 on two of our boxes and the problem stopped. We may be forced to revert all servers to 5.1, but are really hoping to find a fix or workaround. We are opening a case with VMware, but thought we’d ask here as well.

Observations

We are running the latest version of ESXi 5.5, build 1892794.
Guest BSODs have occurred on all ESX hosts we have. We have two server hardware configurations that are very similar (details below) and the BSODs occur with equal frequency on both.
BSODs happen in both Windows 2012 R2 and Windows 2008 R2 VMs.
EDIT: BSODs occur if both “hypervisor.cpuid.v0 = FALSE” is configured and the Hyper-V role is installed. No BSODs occur if Hyper-V is removed or if “hypervisor.cpuid.v0 = FALSE” is removed.
The “hypervisor.cpuid.v0 = FALSE” setting is used to make Hyper-V think it is running on native hardware. Without it, Windows knows it is running in a VM and the BSODs go away. But Hyper-V refuses to start nested VMs. I wish we didn't need the setting, but, alas, we do.
We've never seen a BSOD that wasn't triggered by a suspend/resume.
Not every suspend/resume causes a BSOD. During tests, we see a BSOD about 20% of the time that a suspend/resume is performed.
This occurs on ESXi hosts whether they are busy or not. We took a server out of rotation for isolated testing and saw the exact same behavior.
There are various BSOD messages. The most common is “IRQL_GT_ZERO_AT_SYSTEM_SERVICE” with a stop code of 0x0000004A. But there are several others.
We thought this might have something to do with the Intel E5 processor bug (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=207379...). However, we are using E5 V1 processors, not V2 processors. We updated to the latest Dell BIOS (2.2.3) just in case. It made no difference.
We tried disabling all power settings in the server BIOS, as well as within ESX. This included setting everything to max performance, and disabling C-States. No difference.
We disabled every feature available in the BIOS of the VMs, including all the caching options. No difference.
We use the following settings to enable nested virtualization:
- Hardware CPU/MMU
- vhv.enable = TRUE
- hypervisor.cpuid.v0 = FALSE
We tried using “windowsHyperVGuest” as the guest OS identifier instead of the above settings. Nested virtualization worked fine, but the BSODs still occurred at the same rate.
EDIT: We tried upgrading the VMs from hardware version 9 to hardware version 10. This didn't help.
EDIT: We tried upgrading VMware Tools from version 9.0.0.782409 to 9.4.6.1770165. This didn't help.
We tried enabling CPU performance counter virtualization in the VMs. No help.
We reverted two of our servers to 5.1 and the BSODs went away completely. No other changes were made to the servers or the VMs. Just reverting to 5.1 fixed the problem.
We noticed a dramatic difference in the amount of time it takes the two versions of ESXi to suspend these VMs. ESXi 5.1 suspends these VMs in 2-3 seconds, while ESXi 5.5 takes 30 seconds to a minute. Something very different is definitely occurring during the suspend process. Suspend times in 5.5 are longer whether or not a BSOD occurs.

Our Hardware

Dell PowerEdge R720xd OR Dell PowerEdge R620
BIOS Version: 2.2.3
CPUs: 2 - Intel Xeon E5-2670 0 @ 2.60GHz
RAM: 384GB (24 Matched Dell 16GB DDR3 Synchronous Registered (Buffered) DIMMS)
Controller: PERC H710P Mini 1GB NVRAM
OS Drive: 2 - 240GB S3500 SSD drives in a RAID 1 Mirror (Slot 00 – 01)
VM Storage Drive: 7 - 480GB S3700 SSDs in a RAID5 with a dedicated HS spare (Slot 02 - 09)
1 - Intel 2P X540/2P I350 rNDC
1 - Intel Gigabit 4P I350-t Adapter

The obvious fix is to roll back to 5.1. However, we have completely unrelated issues with 5.1 (a topic for a different post) that we don’t have with 5.5 and would rather stay with 5.5. But these BSODs are a deal-breaker. Any help or direction in resolving the BSODs will be greatly appreciated!

admin · ‎07-21-2014

In some cases, one byte stored in the checkpoint file is stale. This only affects nested guests. When resuming a Hyper-V management OS from a corrupted checkpoint, any of a variety of BSODs may result.

You should file a support request for an express patch with a fix for PR 1289485.

View solution in original post

admin · ‎07-18-2014

Can you post a vmware.log file for one of the VMs?

pscott2 · ‎07-18-2014

Log files for a VM that BSOD'd are now attached to the original post. There are 3 files (vmware.log, vmware-1.log, and vmware-2.log), zipped into vmware.log.zip. The BSOD occurred on the second suspend/resume cycle. I powered off the machine after the BSOD, but other than that, there was no further activity on the machine. Happy to provide any other evidence that might be helpful.

admin · ‎07-18-2014

It looks like you have VMware Tools installed in the VM. The BSOD seems indicative of an in-guest driver bug, and it could be related to VMware Tools.

Can you try a VM without VMware Tools installed?

If you would be willing to share a memory.dmp, that may also help shed some light on the issue. I'll send you a PM with instructions.

pscott2 · ‎07-18-2014

Dump file uploaded. Thanks so much for taking a look!! The dump file isn't from the same VM that the vmware.log file came from, but it occurred under the exact same conditions with the same BSOD message.

I removed VMware Tools entirely and can still BSOD the VMs by suspending/resuming.

admin · ‎07-18-2014

The memory.dmp that you uploaded implicates vmtoolsd.exe (or more specifically, a supporting kernel extension). Can you upload a memory.dmp from a BSOD with VMware Tools removed?

admin · ‎07-18-2014

Actually, I think this may be an issue with the backwards-compatibility checkpointing of virtual hardware version 9. Have you tried upgrading your VMs to virtual hardware version 10?

pscott2 · ‎07-18-2014

Yes, I'm afraid we tried upgrading to hardware version 10. It didn't help. Sorry, I should have mentioned that in the original post. Would it help to see a memory dump or log file from a v10 VM?

pscott2 · ‎07-19-2014

I've uploaded "BSOD-NoVMwareTools-HardwareV9.zip" to the location you provided. It contains the vmware.log files and memory.dmp file for the VM after VMware Tools were completely removed. The VM was using hardware version 9.

I upgraded the VM to hardware version 10 and still got BSODs. I can provide log and dump files for that as well if you think they'll be helpful.

Thanks again for your assistance!

admin · ‎07-19-2014

pscott2 wrote:

Guest BSODs occur whether or not nested VMs are present in the guest. But only VMs configured with “hypervisor.cpuid.v0 = FALSE” in their VMX BSOD.

I suspect that this observation is misleading. Is it the case that only VMs configured with “hypervisor.cpuid.v0 = FALSE” and with the Hyper-V role installed BSOD? When the Hyper-V role is active, there will always be at least one nested VM present in the guest: the management OS.

The management OS runs as a privileged guest under Hyper-V, and I believe there may be some issues regarding resumption from a checkpoint taken while a privileged nested VM is running. Still investigating...

pscott2 · ‎07-19-2014

Good point! I removed the Hyper-V role and can no longer trigger a BSOD. I did this for both Windows 2008 R2 and Windows 2012 R2 VMs. When “hypervisor.cpuid.v0 = FALSE” is present and the Hyper-V role is installed, BSODs occur. Removing either “hypervisor.cpuid.v0 = FALSE” or the Hyper-V role stops BSODs from occurring.

admin · ‎07-19-2014

pscott2 wrote:

I've uploaded "BSOD-NoVMwareTools-HardwareV9.zip" to the location you provided. It contains the vmware.log files and memory.dmp file for the VM after VMware Tools were completely removed. The VM was using hardware version 9.

This is a slightly different BSOD (stop code 0xB8), and seems to implicate the E1000 driver. There's no obvious correlation between the two.

pscott2 · ‎07-19-2014

Yes, the BSODs are not always the same. The first one I uploaded (0x4A) is the most common, but there are several others. The only correlation is that they happen in 5.5, but never in 5.1 😞

Something changed in the way ESXi suspend/resumes VMs between v5.1 and v5.5. Even when we don't get blue screens, the process is noticeably longer in 5.5.

admin · ‎07-19-2014

Do you get similar BSODs if you take a snapshot of a running VM and revert to the snapshot?

pscott2 · ‎07-20-2014

I cannot create a BSOD by taking a snapshot of a running machine and reverting to the snapshot. I did 50 reversions, just to be safe. No BSODs. After that test passed, I suspend/resumed and got a BSOD on the second try. So, I think we can safely say this affects suspend/resume but NOT snapshot/revert.

admin · ‎07-20-2014

50 successful reversions of the same snapshot may simply be indicative of a problem on the save side rather than the restore side.

admin · ‎07-21-2014

I believe I have discovered the problem. Unfortunately, I do not think there is a workaround.

admin · ‎07-21-2014

In some cases, one byte stored in the checkpoint file is stale. This only affects nested guests. When resuming a Hyper-V management OS from a corrupted checkpoint, any of a variety of BSODs may result.

You should file a support request for an express patch with a fix for PR 1289485.

pscott2 · ‎07-23-2014

Thanks so much for spending time on this! I've requested an express patch as you recommended.

admin · ‎07-23-2014

Thanks for bringing this issue to my attention. I'm sorry I was unable to provide a workaround.

All

Blue Screens when suspending/resuming Windows Hyper-V VMs in ESXi 5.5