VMware Cloud Community
arsudarsan
Contributor
Contributor
Jump to solution

VMs powered off over riding isolation response

We have 2 ESx 3.5 update 3 clusters in our environment. The clusters have HA and DRS. The isolation response is set to leave the VMs powered on. During a network outage, on one of the hosts, all the VMs had powered off. None of the VMs on other hosts powered off. The VMs had come back up on the other hosts in the cluster once the network was back up and VC was reachable.

Previous to the network maintenance, due to some issue, this particular host had disconnected from VC earlier. We could not connect to the server through VI Client and had restarted the hostd service. Before the network maintenance, we had identified that this host alone was not reporting any performance data to the VC server. Also a few tasks initiated on the host would go to 100% but never show completed.

My query is could the above issue have caused the VMs to reboot despite isolation response setting. Before the maintenance, I could find this host recieving heart beats from VC server and the other hosts in the cluster and VC was not showing any error related to HA on the cluster or the particular host.

We rebooted the host after the network maintenance and reconfigured HA on the cluster. Since then it has been working fine. We had another network maintenance and we had no issues with VMs rebooting.

Reply
0 Kudos
1 Solution

Accepted Solutions
kjb007
Immortal
Immortal
Jump to solution

Looks like your ha agent may have died along with your connection from the host to VC. There are several logs related to HA, under /var/log/vmware/aam. You can check those to see if they supply any further insight as to why HA acted differently on one host vs the others.

-KjB

VMware vExpert

Don't forget to leave points for helpful/correct posts.

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

View solution in original post

Reply
0 Kudos
8 Replies
arsudarsan
Contributor
Contributor
Jump to solution

Can someone clarify me please.

Reply
0 Kudos
krowczynski
Virtuoso
Virtuoso
Jump to solution

Are both ESX pointing to only one physical switch?

MCP, VCP3 , VCP4
Reply
0 Kudos
arsudarsan
Contributor
Contributor
Jump to solution

Out of 3 hosts in a cluster, the host which had the issue and another one are on same switch.

Reply
0 Kudos
krowczynski
Virtuoso
Virtuoso
Jump to solution

What kind off network outage was it?

So if the switch with all hosts connected on it was down, HA could not work because all network was lost.

Or have I missunterstodd you?

MCP, VCP3 , VCP4
Reply
0 Kudos
arsudarsan
Contributor
Contributor
Jump to solution

Apologies for delay in reply and not being clear in the first instance.

During the network maintenance, the switches were upgraded. There was a link outage twice of about 15-20 minutes during which the hosts were not able to connect to each other. Also this cluster is in our DR datacenter while our VC is located in the PROD datacenter. The link between those two were also updated and hence VC was not available during the maintencance.

My query is why the VMs only on the particular host I had mentioned in the OP rebooted while the ones on others did not. During the first outage of 20 min, the VMs on the particular host had powered off and were back up only after 2 hours on the other 2 hosts after VC was available.

Reply
0 Kudos
kjb007
Immortal
Immortal
Jump to solution

Looks like your ha agent may have died along with your connection from the host to VC. There are several logs related to HA, under /var/log/vmware/aam. You can check those to see if they supply any further insight as to why HA acted differently on one host vs the others.

-KjB

VMware vExpert

Don't forget to leave points for helpful/correct posts.

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
arsudarsan
Contributor
Contributor
Jump to solution

Yes, we too suspect the same. Unfortunately, the logs previous to the date of network maintenance are not available. If HA had been lost during the first disconnection, the atleast VC must have shown a red triangle on the host indicating an error with HA but we did not find any. This why we are not able to find an exact root cause.

Reply
0 Kudos
kjb007
Immortal
Immortal
Jump to solution

Did you not say in your post that the host remained disconnected until a reboot or maintenance was done on the server itself?

-KjB

VMware vExpert

Don't forget to leave points for helpful/correct posts.

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos