We use ESX to virtualize ESXi and Windows Hyper-V machines for training scenarios. We have dozens (even hundreds) of these VMs in use across a large farm of ESX hosts at any given time. Students are able to suspend their VMs at any time and return to them later. This worked flawlessly on ESXi 5.1. Our troubles began when we upgraded to ESXi 5.5. After the upgrade, the Windows Hyper-V guests began crashing with a blue screen sometime during the suspend/resume process. We reverted to 5.1 on two of our boxes and the problem stopped. We may be forced to revert all servers to 5.1, but are really hoping to find a fix or workaround. We are opening a case with VMware, but thought we’d ask here as well.
Observations
Our Hardware
The obvious fix is to roll back to 5.1. However, we have completely unrelated issues with 5.1 (a topic for a different post) that we don’t have with 5.5 and would rather stay with 5.5. But these BSODs are a deal-breaker. Any help or direction in resolving the BSODs will be greatly appreciated!
In some cases, one byte stored in the checkpoint file is stale. This only affects nested guests. When resuming a Hyper-V management OS from a corrupted checkpoint, any of a variety of BSODs may result.
You should file a support request for an express patch with a fix for PR 1289485.
Can you post a vmware.log file for one of the VMs?
Log files for a VM that BSOD'd are now attached to the original post. There are 3 files (vmware.log, vmware-1.log, and vmware-2.log), zipped into vmware.log.zip. The BSOD occurred on the second suspend/resume cycle. I powered off the machine after the BSOD, but other than that, there was no further activity on the machine. Happy to provide any other evidence that might be helpful.
It looks like you have VMware Tools installed in the VM. The BSOD seems indicative of an in-guest driver bug, and it could be related to VMware Tools.
Can you try a VM without VMware Tools installed?
If you would be willing to share a memory.dmp, that may also help shed some light on the issue. I'll send you a PM with instructions.
Dump file uploaded. Thanks so much for taking a look!! The dump file isn't from the same VM that the vmware.log file came from, but it occurred under the exact same conditions with the same BSOD message.
I removed VMware Tools entirely and can still BSOD the VMs by suspending/resuming.
The memory.dmp that you uploaded implicates vmtoolsd.exe (or more specifically, a supporting kernel extension). Can you upload a memory.dmp from a BSOD with VMware Tools removed?
Actually, I think this may be an issue with the backwards-compatibility checkpointing of virtual hardware version 9. Have you tried upgrading your VMs to virtual hardware version 10?
Yes, I'm afraid we tried upgrading to hardware version 10. It didn't help. Sorry, I should have mentioned that in the original post. Would it help to see a memory dump or log file from a v10 VM?
I've uploaded "BSOD-NoVMwareTools-HardwareV9.zip" to the location you provided. It contains the vmware.log files and memory.dmp file for the VM after VMware Tools were completely removed. The VM was using hardware version 9.
I upgraded the VM to hardware version 10 and still got BSODs. I can provide log and dump files for that as well if you think they'll be helpful.
Thanks again for your assistance!
pscott2 wrote:
- Guest BSODs occur whether or not nested VMs are present in the guest. But only VMs configured with “hypervisor.cpuid.v0 = FALSE” in their VMX BSOD.
I suspect that this observation is misleading. Is it the case that only VMs configured with “hypervisor.cpuid.v0 = FALSE” and with the Hyper-V role installed BSOD? When the Hyper-V role is active, there will always be at least one nested VM present in the guest: the management OS.
The management OS runs as a privileged guest under Hyper-V, and I believe there may be some issues regarding resumption from a checkpoint taken while a privileged nested VM is running. Still investigating...
Good point! I removed the Hyper-V role and can no longer trigger a BSOD. I did this for both Windows 2008 R2 and Windows 2012 R2 VMs. When “hypervisor.cpuid.v0 = FALSE” is present and the Hyper-V role is installed, BSODs occur. Removing either “hypervisor.cpuid.v0 = FALSE” or the Hyper-V role stops BSODs from occurring.
pscott2 wrote:
I've uploaded "BSOD-NoVMwareTools-HardwareV9.zip" to the location you provided. It contains the vmware.log files and memory.dmp file for the VM after VMware Tools were completely removed. The VM was using hardware version 9.
This is a slightly different BSOD (stop code 0xB8), and seems to implicate the E1000 driver. There's no obvious correlation between the two.
Yes, the BSODs are not always the same. The first one I uploaded (0x4A) is the most common, but there are several others. The only correlation is that they happen in 5.5, but never in 5.1 😞
Something changed in the way ESXi suspend/resumes VMs between v5.1 and v5.5. Even when we don't get blue screens, the process is noticeably longer in 5.5.
Do you get similar BSODs if you take a snapshot of a running VM and revert to the snapshot?
I cannot create a BSOD by taking a snapshot of a running machine and reverting to the snapshot. I did 50 reversions, just to be safe. No BSODs. After that test passed, I suspend/resumed and got a BSOD on the second try. So, I think we can safely say this affects suspend/resume but NOT snapshot/revert.
50 successful reversions of the same snapshot may simply be indicative of a problem on the save side rather than the restore side.
I believe I have discovered the problem. Unfortunately, I do not think there is a workaround.
In some cases, one byte stored in the checkpoint file is stale. This only affects nested guests. When resuming a Hyper-V management OS from a corrupted checkpoint, any of a variety of BSODs may result.
You should file a support request for an express patch with a fix for PR 1289485.
Thanks so much for spending time on this! I've requested an express patch as you recommended.
Thanks for bringing this issue to my attention. I'm sorry I was unable to provide a workaround.