'Proper' way to deal with an isolated vsphere host...

hostasaurus · ‎07-22-2015

This does not happen often, but often enough to be extremely annoying and time consuming. We have a vCenter 5.5 Enterprise Plus setup with a ~20 ESXi hosts. Every so often, a host will become isolated for reasons unknown. Vmware support believes it to be related to firmware versions on the hosts' virtual NIC's as they're part of a Cisco UCS setup, so that's being worked on independently. Anyway, when this occurs, the VM's will typically still be running fine. If I get on the console of the host that is isolated, sometimes it will let me log in, sometimes it won't, sometimes it will but only after a solid 10 or 20 minutes of waiting after hitting enter from the password prompt. If I try to restart the agents, it will usually show as "Stopping..." but I've let it wait hours after that and it has never progressed any further.

That being the case, when this occurs, our standard process is to shut down all the VM's running on that host manually by logging into them directly and stopping the OS, since vCenter, vSphere and the vmware tools are all useless at this point from an automation standpoint. This is of course a pain and labor intensive. We have to do it though because if we just bounce the host, sometimes the guest filesystem will need repair when the guests come back up, or, the guest won't reboot on another node until you get in via ssh and clear a lock.

Bounce the host, and the next big assortment of issues begins. First question; with the host reset, is there a way to speed up the remaining hosts going through the HA election process so we can more quickly arrive at the state where the VM's can be selected for power on? It seems to take forever if we leave the questionable host down, but still takes a while even if we let it boot back up and re-join.

In either case, once we're at the point of powering guests back on, which of course isn't automatic, we often run into the issue of each guest raising the question of was this VM moved or copied? We have to manually go into each one to answer. Any way to optimize that process?

Finally, another issue we see is networking not coming up as 'connected'. If you try to edit and check connected, the error "Invalid configuration for device '0'" shows up, as described in this article:

VMware KB: Enabling a virtual NIC for a virtual machine in a vDS portgroup after a storage migration...

My solution to that is typically power guest back down gracefully since tools are now running and accessible, vmotion it to another host, boot back up, and then it seems to find a working port.

I would love any and all suggestions on how better to deal with the above issue. If we didn't need vm monitoring on some of the vm's, it almost seems like HA causes more downtime than it protects from, in an imperfect world where hosts may have issues occasionally.

Thanks!

All

'Proper' way to deal with an isolated vsphere host?