I've had a problem where all the Windows server 2003 R2 VMs on at the time blue-screened with:
*** Hardware Malfunction
Call your hardware vendor for support
*** The system has halted ***
Interestingly the vCenter Applicance that was on the same box and running at the time seemed to be unaffected. The only thing I can attribute it to is exporting system logs from the client at around that time. The events tab for the VMs only shows the reset I did after it occured, there are no other events from around that time. Is this a known issue?
Only 2003 R2? Did you have any other VM running on that machine or cluster?
Did you have a storage outage? Disk Timeout!
Did you check the windows event log?
Frank
WHat version of vSphere?
are the hosts on HCL.?
all the blue screened servers was on the same node?
default gateway IP conflict?
power spike?
are the hosts blades? Do the enclosure works correctly? Are backplane OK?
what storage are u using? How much LUN and what are the sizes of the luns?
Aleph0
Only 2003 R2? Did you have any other VM running on that machine or cluster?
The only other VM that was running was the vSphere Virtual Appliance which is Linux based.
Did you have a storage outage? Disk Timeout!
Not that I'm aware of and if I did why wouldn't it affect the vSphere VM?
Did you check the windows event log?
Yes, there was nothing in there. As multiple systems were affected I'm looking at common factors starting with VMware.
WHat version of vSphere?
5.1
are the hosts on HCL.?
If you mean 'is the host running on supported hardware' then yes and it has been working fine on it for months.
Were all the blue screened servers on the same node?
Yes, this all occured on the same hosst.
default gateway IP conflict?
I don't see hwo a gateway IP address conflict would trigger a bluescreen? I'm not aware of any IP conflict.
power spike?
No, this didn't affect the host or other physical machines using the same power supply.
Are the hosts blades? Do the enclosure works correctly? Is the backplane OK?
What storage are you using? How much LUN and what are the sizes of the LUNs?
The host hardware looks fine, I don't think that this is a hardware issue.
I've asked for IP conflict because I've seen the same issue with a default gateway IP conflict...
could you check that host with memtest86 for at least 72 hours to see if the DIMM into the server has some issue?
check if this apply to you http://www.tricksguide.com/blue-screen-error-hardware-malfunction-pci-express-error-hp-proliant-serv...
Were those server physical and then converted to virtual?
The hardware isn't HP but interestingly one of the VMs does have Kaspersky AV which was mentioned in the article, although there were no errors reported by it from around the time of the problem.
ESXi 5.1 requires the NX/XD bit to be enabled for the CPU in the BIOS.
None of the affected VMs have been P2Ved.
Are there multiple ESX hosts, if so, were any vMotions occurring when the BSODs occurred? Sometimes, if the vMotions don't work properly, a BSOD could result.
Were there any changes to the network either on the host(s) or on the switch side?
Like aleph0 said, an issue with shared storage where the VMs live could certainly cause a BSOD.
- Ben
There weren't any vmotions taking place around that time and no network changes. As I said the only thing I did around that time was to export system logs.
Take a look to the following KB http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200571...
Perhaps you selected 'Select the system log manifest group HungVM' when collecting logs.
Regards,
Daniel
