VMware Cloud Community
branfarm1
Contributor
Contributor

Repeated VM power on/power off causes ESX cluster crash

Has anyone out there exprienced any issues with repeated powering on/powering off of a VM and subsequent ESX cluster "meltdown"?

Here's the situation: We have 3 clustered ESX 3.0.2 hosts, connected to an iSCSI SAN. We had a user who was experimenting with a linux virtual machine who was doing a combination of "hammering" the box (compiling, building, etc), and at the same time powering off/powering on the VM when it would crash unexpectedly. He came over to me to tell me that his machine wouldn't power off anymore. Sure enough, it was stuck at 80% during a power off and hadn't moved for quite some time. I attempted to kill that VM's process and that seemed to start what I can only describe as a meltdown. After I forcibly killed the VM process, I started seeing really odd VIC behavior and extremely slow response time. Each one of the hosts lost connectivity with the iSCSI SAN (logs confirmed that it wasn't receiving heartbeats from the target), and attempting to browse to /vmfs/volumes resulted in a complete hang of the SSH session. Eventually all 3 machines became unmanageable and we were forced to power-off each of the ESX hosts, and bring them back online one at at time.

Here's some important information:

1. This is the 2nd time this has happened, and coincidentally, both instances occured when this user was "playing around" with a linux VM.

2. The Linux VM in question did not have any VM tools installed

3. The user was admittedly hammering the VM with high CPU/Mem utilization.

4. Even when the ESX hosts were reporting loss of connectivity to the iSCSI array, some hosts running on the iSCSI LUNS were still functioning fine.

I'm just curious if anyone has any insight into what might have occured here?

Thanks for the help!

0 Kudos
2 Replies
GBromage
Expert
Expert

This might be part of the virtual hardware emulation.

After all, if you have a physical server and you power cycle it several times rapidly, you'll blow the power supply. This might be the VMware implementation of the same thing. :smileylaugh:

Depending on what he's doing, the high memory utilization mightt have been related to not having the VM tools installed. I suspect that the "meltdown" was caused by killing the process. Depending on the state the VMWare server was in, if it was wedged in an odd state then killing it may have caused an error in the parent process (being the VMWare kernel) and/or left parts of the VMFS file system locked, which would have affected scanning from the other hosts.

Pure conjecture on my part, but it's a workable theory.

I hope this information helps you. If it does, please consider awarding points with the 'Helpful' or 'Correct' buttons. If it doesn't help you, please ask for clarification!
branfarm1
Contributor
Contributor

Thanks for the response -- I think you're theory is quite possible. If that's the case though, what's the best approach to take with an errant VM?

0 Kudos