In our environment, it appears that network changes or Host misconfiguration has a greater liklihood of triggering an HA event than an actual Host failure.
The things I've done to reduce these events:
Always having 2-pNics on the Service Console vSwitch.
Each NIC goes to a seperate physical switch.
Spanning tree protocol (STP)- disable STP on physical network interfaces connected to the ESX Server host. For Cisco-based networks, enable port fast mode for access interfaces or portfast trunk mode for trunk interfaces (saves about 30 seconds during initialization of the physical switch port).
Etherchannel negotiation, such as PAgP or LACP - must be disabled because they are not supported.
Trunking negotiation (saves about four seconds).
What other things could I do?
You could increase the timeout: das.failuredetectiontime (milliseconds). In most environments 15 seconds is a bit too eager, I think.
Also have a second service console port on a different network segment -
I've considered that so, with a 2nd Service Console, a Host isn't considered isolated unless both IPs are inaccessible? Is that correct?
You could increase the timeout: das.failuredetectiontime (milliseconds). In most environments 15 seconds is a bit too eager, I think.
Agreed, 15 seconds is a relative hair trigger and I remmebered that I had already set this to 60 seconds.