We have 2 ESx 3.5 update 3 clusters in our environment. The clusters have HA and DRS. The isolation response is set to leave the VMs powered on. During a network outage, on one of the hosts, all the VMs had powered off. None of the VMs on other hosts powered off. The VMs had come back up on the other hosts in the cluster once the network was back up and VC was reachable.
Previous to the network maintenance, due to some issue, this particular host had disconnected from VC earlier. We could not connect to the server through VI Client and had restarted the hostd service. Before the network maintenance, we had identified that this host alone was not reporting any performance data to the VC server. Also a few tasks initiated on the host would go to 100% but never show completed.
My query is could the above issue have caused the VMs to reboot despite isolation response setting. Before the maintenance, I could find this host recieving heart beats from VC server and the other hosts in the cluster and VC was not showing any error related to HA on the cluster or the particular host.
We rebooted the host after the network maintenance and reconfigured HA on the cluster. Since then it has been working fine. We had another network maintenance and we had no issues with VMs rebooting.
Looks like your ha agent may have died along with your connection from the host to VC. There are several logs related to HA, under /var/log/vmware/aam. You can check those to see if they supply any further insight as to why HA acted differently on one host vs the others.
-KjB
VMware vExpert
Don't forget to leave points for helpful/correct posts.
Can someone clarify me please.
Are both ESX pointing to only one physical switch?
Out of 3 hosts in a cluster, the host which had the issue and another one are on same switch.
What kind off network outage was it?
So if the switch with all hosts connected on it was down, HA could not work because all network was lost.
Or have I missunterstodd you?
Apologies for delay in reply and not being clear in the first instance.
During the network maintenance, the switches were upgraded. There was a link outage twice of about 15-20 minutes during which the hosts were not able to connect to each other. Also this cluster is in our DR datacenter while our VC is located in the PROD datacenter. The link between those two were also updated and hence VC was not available during the maintencance.
My query is why the VMs only on the particular host I had mentioned in the OP rebooted while the ones on others did not. During the first outage of 20 min, the VMs on the particular host had powered off and were back up only after 2 hours on the other 2 hosts after VC was available.
Looks like your ha agent may have died along with your connection from the host to VC. There are several logs related to HA, under /var/log/vmware/aam. You can check those to see if they supply any further insight as to why HA acted differently on one host vs the others.
-KjB
VMware vExpert
Don't forget to leave points for helpful/correct posts.
Yes, we too suspect the same. Unfortunately, the logs previous to the date of network maintenance are not available. If HA had been lost during the first disconnection, the atleast VC must have shown a red triangle on the host indicating an error with HA but we did not find any. This why we are not able to find an exact root cause.
Did you not say in your post that the host remained disconnected until a reboot or maintenance was done on the server itself?
-KjB
VMware vExpert
Don't forget to leave points for helpful/correct posts.