VMware Cloud Community
joepe
Contributor
Contributor

Host isolation although uplink is redundant

Dear community

The environment:

- ESXi 4.1 U1 HA-DRS Cluster (3 Hosts)

- vCenter is running in a VM within this cluster

- iSCSI LUNs as datastores

- Redundant vSwitches: e.g. vmnic0 -> SwitchA, vmnic1 -> SwitchB

- Switches: Cisco 2960G, no channeling/trunking possible

Now, SwitchA was down due to a power failure. Given that vSwitch0 consists of two vmnics (vmic0, vmnic1) in active/active configuration, I would assume that ESXi does transparently remove the failed link and continue to use only vmnic1 resulting in almost zero loss of packets.

Reality looked different. Due to the switch failure, the log showed:

"Lost uplink redundancy on virtual Switch "vSwitch0". Physical NIC vmnic0 is down. Affected portgroups: ..."

Shortly after I got these messages:

"Node esxi1 has stopped receiving heartbeats from Primary node esxi2 1/9. Declaring node as unresponsive."

"user esxi1 VMware HA Agent Isolated, Notifying VPXA"

Due to the isolation, all VMs were shutdown according to the HA configuration, which is expected.

So, the failover did not work as expected and all three hosts were isolated. Because all three are setup the same way, behaviour on all hosts was the same.

Management Network "Failover and Load Balancing" Parameters:

Load Balancing          Port ID

Network Failure Detection: Link status only

Notify Switches:          Yes

Failback:                     Yes

Active Adapters:          vmnic0,vmnic1

Standby Adapters:         None

Unused Adapters:          None

Do you have any idea what could be wrong?


Thx

Tags (2)
Reply
0 Kudos
4 Replies
a_p_
Leadership
Leadership

How are the physical ports configured? Is SpanningTree is enabled (not set to spanning-tree portfast) it can take up to ~45 seconds for the link to come up on the other switch/port. Depending on the HA isolation response settings (default: 15 seconds) this could cause HA to trigger.

Once you take a look at the physical port configuration, make also sure the ports are set to "switchport mode access" to allow multiple MAC addresses to register on this port.

If you are working with VLANs you may use "spanning-tree mode trunk" and "switchport mode trunk".

André

Reply
0 Kudos
joepe
Contributor
Contributor

Hi

Thank you for your hints. A switchport config looks like this:

interface GigabitEthernet0/6
description description esxi1 vmnic0
switchport trunk native vlan 99
switchport trunk allowed vlan 1,100,172
switchport mode trunk

So there are actually VLANs configured (VLAN 1 as management).

Regarding spanning-tree. The current ports status is

Interface           Role Sts Cost      Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Gi0/6               Desg FWD 4         128.6    P2p

on both ports (on both switches). Do i have to change it to portfast?

Kind regards

Reply
0 Kudos
a_p_
Leadership
Leadership

There are a couple of switch port settings that you should look at. "spanning-tree portfast trunk" is definitely one of them, otherwise you need to configure a longer failure detection time (see http://kb.vmware.com/kb/1006421)

for a sample configuration, see http://kb.vmware.com/kb/1004074

André

Reply
0 Kudos
depping
Leadership
Leadership

Yes you will to set it to portfast or portfast trunk to avoid things like these happen.

Duncan

Reply
0 Kudos