Trying to figure out some strange behavior with our ESX 3.0.1 hosts and VCenter 2.0.1 HA.
HA is configured on our cluster, all default settings.
What gets pinged and by who to determine if a host is isolated? Does each VM on a host ping the ServConsole Gateway or is the host doing the pinging?
Why would a host appear to be isolated in the event of router maintenance? Especially when both the service console vswitch and the vm data network vswitch are teamed with two separate physical nics, each nic connected to separate routers. And only one router is getting maintenance?
If a host is deemed isolated, and vm's are powered down by HA. They are then migrated to other hosts, but not powered back up???
From what I understand - VC monitors all of this for HA. VC will initiate the failover.
I think its supposed to work outside of VC since it can restart a VirtualCenter VM running on an ESX server that fails.
Here is a list of the functions and how they are effected if the VirtualCenter server is down per VMware documentation:
\- VMotion will be unavailable, but any migrations in progress will continue
\- Hosts configured for a VMware HA cluster will continue to function, although specific VM information on priority and isolation response is based on what was cached prior to the loss of the VirtualCenter Server
\- Hosts configured within a VMware DRS cluster will continue to function, but no resource optimization recommendations or load balancing that requires VMotion will be made
I believe the HA agents that run on each ESX server check in with each other, and if there is no communication back from a node after x amount (not sure on the exact time, I think 60 sec.) of time it is considered "Isolated". Since the HA agent is running in the Service Console, then if the SC loses network connectivity HA will consider that node isolated. But this can happen for other reasons, such as an error with the HA agent.
There is a bunch of info that can help see what is going on in /opt/LGTOaam512/log If you look in here you might find some clues as to why it has failed.
Its more complex than just pinning another server. I'm still trying to figure out all the inner workings. If you look in /opt/LGTOaam512/bin you can see there are quite a few different scripts in here that give some clues as to what is going on behind the scenes.
This helps a bit, thanks for the reply.
I too wish there was a bit more detailed info on just how HA works and how it determines host isolation. I was able to confirm in some other log files that only 1 of the 2 nics in my service console team were down at any time. So I'm still very confused as to why HA thinks the host was isolated.
As a side note, I disabled HA on my cluster (as we had more router maintenance last night) and I wanted to nip any vm issues before they happened. This morning, I noticed an event log in Virtual center on one of my servers, "Host is not responding". Again, I can't figure out why this is the case, as the nic team shows the server should have had connectivity at all times during the router maintenance.
We experienced a simiar issue, check out the configuration of your teamed nics.
We had two nics teamed for the service console and during HA testing discovered that the deafult settings caused HA to invoke not when one of the nics cables was pulled but it failed over when it was replaced....bizare I know...
We did some testing further established that we had to set specific settings to get HA to behave when you have teamed nics.
I'll try and post an image or link to an image of the config.
That would be great if you could post any further info.
Also, how can I determine if HA kicked in at nic failure or at nic recovery? What logs will show dates and times of HA events? So far I haven't had much luck in linking events across multiple logs.
I think I may also have another possible cause to this, but am still working out the details. I'll post my findings here shortly....
I'm not sure if the sreenshot will work but here goes anyway....
We are still testing ourselves but found the above solved our issue.
I would recommend devising a number of scenarios and then testing each one to confirm the behavior and monitor both VC and the local HA logs.
I typically run a watch/tail of the log file vmware.log no access at present to confirm exact name of file.
In addition we have tested and confirmed HA does work even if you virtual center server is off the air or down.
Hope its helps...
What version of ESX and VCenter are you running?
We're at 3.0.1 and 2.0.1, respectively.
Here are some other details that I've gathered on our situation...
HA relies on being able to communicat to both the service console network gateway AND the other hosts participating in the cluster. The communication with the other hosts relies on DNS resolution of the FQDN for the host server.
In our situation, we have a virtual HSRP gateway configured on our redundant routers, and in the event of a router failure (or in our case, router reboot due to planned maintenance) that HSRP gateway could take up to 15 seconds to be active on the standby router. What happens at that point is our esx hosts cannot get outside of their vlan due to the gateway being down, and therefore cannot ping/communicate with the other hosts via DNS names. (esx host can't talk to dns server because of gateway disruption) And since we don't have the /etc/hosts file on our esx servers setup with the FQDN's of the other servers in the cluster, HA cannot resolve the DNS names of the other servers. The esx server is then deemed isolated, and begins the vm shutdown process.
I too see, based on the log entries, that HA kicks in and isolates our esx server when the nic link comes back up, and not when it fails. (a strange scenerio indeed) Can you confirm in your environment, that you have your /etc/hosts files setup to include all esx servers in the cluster? Or does it only include the local host?
We are running :
ESX 3.0.1 Build 37303 (Though VC reports ESX 3.0.1 Build 35804 ? )
VC 2.0.1 Build 33643
At present we just have the FQDN of the ESX server itself and we have added the VC Server as we use a SMB mount point on this server.
How have you configured your network teaming for the service console ?
Our Service Console network team is using the default setup.
I know you mentioned better results by changing the serv-console nic teaming setup, but I wonder if we're both suffering from the same symptoms in regards to the dns server availability I described in my last post.
If you are using HSRP gateways similar to us, and don't have FQDN's of each server in your /etc/hosts files; when a router dies and the gateway fails over, the esx server can't talk to other esx servers via DNS name and it thinks an isolation event is taking place.
I'll be updating our /etc/hosts files soon, and I believe our network people are planning more router maintenance next week. I'll leave my nic team as is and see if my server goes into isolation mode. If not, then the fix is as simple as adding entries in the /etc/hosts file.
HA works between members in the HA cluster. That means VC is not required after HA is enabled. (however, you need VC to enable it)
members of HA cluster will have a list for the cluster. you can find that at /etc/FT_HOSTS
HA will use FQDN and then cut out the domain and use only short name. That is why it's always good idea to add short name on your /etc/hosts file for each server.
members of HA cluster will monitor each others heartbeat, if it did not receive the heartbeat from other members, it will then try to ping the default gateway, if it can not ping it (some of you may disable icmp for the gateway, and that is a bad idea for HA) it then enter the isolation mode. which is start to power down VMs 15 seconds after the isolation.
A couple of points on this:
1) The previous post is exactly how I'd heard HA isolation is supposed to work.
2) You can specify an alternate address to the default gateway (if, for instance, ICMP is disabled on it)
3) There is a known issue where if network connectivity is lost for between 12-14 seconds HA will fail; the host will begin to turn off the VM's but the other hosts will not bring them online (because the 15 second window has not elapsed). This will result in powered off VM's because of a fake isolation.
I never did work out why VMWare chose not to attempt VMotion before a cold migration for isolation incidents (i.e. if a NIC in the server fails but the HBA's are still OK), but I'm sure there's a good reason - anyone?
Vmotion however, still use service console NIC to initialize. Both ESX server needs to see each other from the service console NICs. That is the 10% of the vmotion process. It then switch to the vmotion NIC to transfer data. If service console lost connection, vmotion will not happen
in addition, Vmotion needs VC. But HA does not.
That certainly explains it (and why VMotions get to 10% even when the VMKernel ports don't have connectivity). I always thought it was a bogus progress bar - seems I was wrong. Thanks for the info!
HA sometimes fail between 12 and 14 second, the reason is one ESX server enter the isolation mode, it starts to power down it's VM. however, there is a chance if the heartbeat restored between second 12 and 14 then the other members of the HA cluster will not consider this ESX server entered into isolation mode, that is why those VM will not power on.
VMware HA is actually the EMC Legato Automated Availability Management 5.1.2 product.
This product has customized scripts (by VMware) to manage the loss of a node in a cluster.
For redundancy, I highly recommend that you you team two NICs for the Service Console, and if you can, plug those 2 network cables on two different network switches.
The ping is actually done on a Multicast address, but I need to peak more under the hood to understand it. As I need to change that multicast address (a multicast MAC address is calculate from the multicast IP address).
There is also a parameter that could be changed to increase the hearbeat time. This might also be usefull for situation described earlier in this discussion.