Current HA policy failing on Service Console Network

In our environment we have six ESX in one cluster on a set

of switches. Each vSwitch has a redundant connection to both physical switches

for redundancy of all networks including management / service console. In the

last four weeks we have had an issue where one of the UPS’ to which one of our

switches connect has tripped. This causes our entire cluster to become

unavailable and while some vm's remain online others do not. This looks to be

attributed to the heartbeat interval and the default policy to shutdown VM’s

when host connection is lost, since it will hit the timeout interval while the

switch reboots.

We have redundant service console connections on each

server. Vmnic0 goes to switch 1 and vmnic1 goes to switch 2. I currently have

the service console/vmotion vswitch configured in active/active mode with both

nic’s. I have been reviewing the HA documents and they talk about using an

active/passive with rolling failover policy. So my question is, if I have both

NIC’s in active/active going to two different switches, is that possibly why I

am losing service console/management access when one of the switches reboots? I

am confused because I know for a fact that if I pull one of the cables from the

san or network vswitches the vm’s automatically failover to other nic’s in the

vswitch team and service continues uninterrupted however this has not proven to

be the case with the service console connections. As such I am confused as to

why the second active connection is not keeping access running when one of the

switches fails.

Any thoughts or ideas on this would be greatly appreciated.


0 Kudos
1 Reply

Im still hoping someone can weigh in on this. I am looking for a definitive answer as to whether or not the service console nic's must be configured in an active/passive state for proper failover or whether that is simply a best practice.

Secondly, how are other configuring their environment to avoid these types of outages? Are you increasing the heartbeat interval? I am still confused as to why with a switch going offline, even though the second service console nic was connected to switch 2 and was active I lost management access one switch 1 rebooted.

0 Kudos