Solved: Re: Troubleshooting ESXi/HA Incident

tmancini · ‎01-25-2011

We had an ESXi server in a two node cluster indicate a network loss to both teamed management NIC's. This caused the VM's to be powered off as the host thought it was isolated.

The physical switches indicate there was no issue with the network but the hostd log shows otherwise.

There was about a 5-6 minute gap in the logs - apparently the server hung during this time period.

Using vMA as the syslog server but it does not seem to collect the aam logs for HA.

Currently at a loss with support as to what may have caused the issue and how the aam logs can be collected as the vilogger only pulls the hostd, messages and vpxa logs.

Just wondering where else to look for evidence on the cause of this issue and why the VM's did not get moved to the other host.

Thanks.

depping · ‎01-27-2011

It's all mentioned in my book but I will give it away for free ;-). (check out my HA deepdive session at VMworld EMEA if you have access to it) If all default values are kept the following is the timing:

0 -> Host isolated

13 -> Isolated host pings isolation address

14 -> Isolation response triggered

15 -> remaining non-isolated hosts try to ping isolated host

16 -> if no response try restarting workload

So depending on how quick things happen the gap is roughly 3 seconds but probably slightly less.

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

View solution in original post

Josh26 · ‎01-25-2011

Hi,

As for why the NICs went offline I don't have any answers.

As for why the VMs didn't restart elsewhere however, odds are this is how the cluster is configured. Look at the HA configuration in the cluster. What is the "Host Isolation Response" ?

depping · ‎01-26-2011

In this case both hosts were isolated. Because of the fact that you either used "power off" or "shut down" as the isolation response your VMs were turned off. However as both were isolated none of the hosts was triggered to power-up VMs. I would recommend setting it to "leave powered on" to avoid issues like these.

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

tmancini · ‎01-26-2011

Thanks for the response.

I believe only one host was isolated and the isolation response was set to power off so that is expected. Just not sure why the guests were not then moved to the available node. This did not behave the same way when it was tested by disabling the management network which moved the VM's and not powered them off.

The problem with leaving VM's running when the host is isolated is that they cannot be managed. If the network further deteriorates for the affected host they will be left running and disconnected with no possibility of being moved to another host - not a good situation.

Upon further investigation of the logs it appears there is a several minute gap when this issue occurred so nothing conclusive, yet.

The isolation message came out of the vSphere client event log. "VM_server was powered off on the isolated host." Surprisingly, these messages are not recorded in a log when they are exported. At least, they could not be found by support. Luckily, they were copied just after the incident.

depping · ‎01-26-2011

I am not sure what you tested, but when HA initiates a failover the VMs are always powered off. HA simply does a start of the VM and doesn't use vMotion so what you mention "when it was tested by disabling the management network which moved the VM's and not powered them off" sounds a bit awkward to me.

I can understand the issues you have with leaving them powered on. However in a two host cluster chances are big that you run into issues like these.

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

tmancini · ‎01-26-2011

Duncan, I believe you are correct. The VM's were powered off but were restarted on an available host when the management network was manually disabled. But, what difference does it make if this is only a two host cluster if there is no attempt to restart them? What configuration changes can be made to ensure the VM's get restarted on an available host?

This was not the expected response from HA and certainly not the way it is portrayed as marketed from VMware.

Unfortunately, with the gaps in the logs and the inability to have a syslog server save the HA logs the cause of this issue may never be determined.

tmancini · ‎01-26-2011

Here's another thought; is it possible that the management network became availble so quickly after the VM's were shutdown that it did not restart them because the nodes were again communicating?

This would seem to make sense since manual intervention was required to get them VM's powered back on and the host's had nothing to do since the outage was not long enough to notice the change.

depping · ‎01-26-2011

That is indeed a possibility. If host "esx01" powers the down at the 13th second, host "esx02" will power them on at the 16th second if it cannot ping host "esx01". If for whatever reason this ping does succeed the VMs will not power on.

That is a 3 second gap you could fall in to. Although chances are tiny, I have seen this happening in the past. HA log files can be found in /var/log/aam/ by the way.

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

tmancini · ‎01-26-2011

Duncan, thank your for the response. It good to confirm my suspicions.

Thanks for the path to the HA logs. I did check them previously but they already rolled. This is a problem since vMA does not collect these logs.

Also, regarding the gap in the logs; the vpxa shows no gap so the server was responding. It's on the hostsd and messages log that stopped logging.

You mentioned a 3 second gap where this may have caused an issue. Do you know the timing of HA's expected response and the actions it is supposed to perform on VM's - or maybe where I can find this info?

Thanks for your help.

depping · ‎01-27-2011

It's all mentioned in my book but I will give it away for free ;-). (check out my HA deepdive session at VMworld EMEA if you have access to it) If all default values are kept the following is the timing:

0 -> Host isolated

13 -> Isolated host pings isolation address

14 -> Isolation response triggered

15 -> remaining non-isolated hosts try to ping isolated host

16 -> if no response try restarting workload

So depending on how quick things happen the gap is roughly 3 seconds but probably slightly less.

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

tmancini · ‎01-27-2011

Thanks for the info, Duncan. Much appreciated.

The SR has been passed to another engineer who has a good suggestion in how to monitor this issue.

I am going to add another isolation address and increase the timeout to help alleviate the issue should it reoccur.

depping · ‎01-28-2011

The chances of issues like this occuring when increasing the time-out will decrease indeed. I see many customers still using 30 seconds as a standard.

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

All

Troubleshooting ESXi/HA Incident