Had an interesting problem last night... After a user complained about a system offline, I found that none of my VM's were accessible.
I was able to connect to the 2 host systems directly using the vclient, all clients showed as "running" but would not respond. Trying to view the console would hang the vclient instance.
I viewed the performance tab, and it was showing some interesting patterns on the various servers, with the vCenter one pegged at 100% CPU.
Attempts to stop the vCenter server using the vclient failed.
I followed KB 1004340 to try and kill the unresponsive vCenter, to no avail. I tried restarting the management agents, following KB 1003490 with no luck. From the CLI on both my hosts, I tried stopping, then killing my VM's, all to no avail. I finally had to hard power down both my hosts.
It was not a good night.
Anyways, I managed to get everything back online, but now going through logs, I'm not sure where to begin to track down this fault. I'm really looking for some information on where to start my hunt...
Windows logs show nothing besides a "Unexpected shutdown" on all VM's, including the vCenter server.
The vCenter Tasks and Events last entry was a "Create virtual machine snapshot" Task, about 5-10 minutes prior to the failure... In fact, it looks like it might have tried to make 4 VM Snapshots all within a 1 minute timespan... Red herring, or start looking here?
Thanks for any help you can provide!
If your VMs crashed and multiple ESX hosts were affected, I would probably start looking at storage issues.
Is it possible that your Storage had a panic of sorts?
Were all VMs hosted on the same storage by any chance?
They are all hosted on the same storage device, I guess I'll start looking there. The SAN it's connected to did not exhibit any bizarre behaviour, and nothing was done with it or to it before or after the hosts stopped responding.
Do you know which ESX log I should be looking into to diagnose?
/var/log will contain all the necessary log files for the ESX host. I think you need to check the vmkernel log files ( /var/log/messages* ) and VM log files ( /vmfs/volumes/<datastore name>/<VM Name>/vmware.log ) to see what really happened to the VMs.
Issue ended up being a combination of 2 issues...
1. Potential incompatibility between ESX 4.1.0 and the SAN's current patch level.
2. SAN's boot module (a USB key!) was not functioning properly and was causing SAN reliability issues.
Thanks for everyone's input, turns out VMWare is still stable and awesome. As for my SAN...