Dear community
The environment:
- ESXi 4.1 U1 HA-DRS Cluster (3 Hosts)
- vCenter is running in a VM within this cluster
- iSCSI LUNs as datastores
- Redundant vSwitches: e.g. vmnic0 -> SwitchA, vmnic1 -> SwitchB
- Switches: Cisco 2960G, no channeling/trunking possible
Now, SwitchA was down due to a power failure. Given that vSwitch0 consists of two vmnics (vmic0, vmnic1) in active/active configuration, I would assume that ESXi does transparently remove the failed link and continue to use only vmnic1 resulting in almost zero loss of packets.
Reality looked different. Due to the switch failure, the log showed:
"Lost uplink redundancy on virtual Switch "vSwitch0". Physical NIC vmnic0 is down. Affected portgroups: ..."
Shortly after I got these messages:
"Node esxi1 has stopped receiving heartbeats from Primary node esxi2 1/9. Declaring node as unresponsive."
"user esxi1 VMware HA Agent Isolated, Notifying VPXA"
Due to the isolation, all VMs were shutdown according to the HA configuration, which is expected.
So, the failover did not work as expected and all three hosts were isolated. Because all three are setup the same way, behaviour on all hosts was the same.
Management Network "Failover and Load Balancing" Parameters:
Load Balancing Port ID
Network Failure Detection: Link status only
Notify Switches: Yes
Failback: Yes
Active Adapters: vmnic0,vmnic1
Standby Adapters: None
Unused Adapters: None
Do you have any idea what could be wrong?
Thx
How are the physical ports configured? Is SpanningTree is enabled (not set to spanning-tree portfast) it can take up to ~45 seconds for the link to come up on the other switch/port. Depending on the HA isolation response settings (default: 15 seconds) this could cause HA to trigger.
Once you take a look at the physical port configuration, make also sure the ports are set to "switchport mode access" to allow multiple MAC addresses to register on this port.
If you are working with VLANs you may use "spanning-tree mode trunk" and "switchport mode trunk".
André
Hi
Thank you for your hints. A switchport config looks like this:
interface GigabitEthernet0/6
description description esxi1 vmnic0
switchport trunk native vlan 99
switchport trunk allowed vlan 1,100,172
switchport mode trunk
So there are actually VLANs configured (VLAN 1 as management).
Regarding spanning-tree. The current ports status is
Interface Role Sts Cost Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Gi0/6 Desg FWD 4 128.6 P2p
on both ports (on both switches). Do i have to change it to portfast?
Kind regards
There are a couple of switch port settings that you should look at. "spanning-tree portfast trunk" is definitely one of them, otherwise you need to configure a longer failure detection time (see http://kb.vmware.com/kb/1006421)
for a sample configuration, see http://kb.vmware.com/kb/1004074
André
Yes you will to set it to portfast or portfast trunk to avoid things like these happen.
Duncan