VMware Cloud Community
KKrtz
Contributor
Contributor

vMotion Multi-NIC setup / network outage

Hello Together.

I have since a few month a strange and serious problem with vMotion in a Multi-Nic setup. This issue doesn't happend in a single NIC configuration Unfortnately the VMware support didn't found anything and maybe here somebody can help me to identitfy the source of my problem!

My environment:

Two HP Proliant DL380G7 running ESXi 5.1

Both HP Server have an NC360T Dual-Port NIC installed

vMotion setup:

Seperate vMotion VLAN

Both vmknic are in the same subnet and VLAN

Both vmknic are connected to one vSwitch

vSwitch has two pNICs in active/active state

vmknic have different failover order on nics

Regarding the pNICs vMotion:

first NIC is onboard / second NIC is located at NC360T

both NICs are connected to one Cisco Catalyst 2960 switch

Now my problem description:

I am selecting around 15VMs to migrate from one host to another. Everything is going fine for maybe a few minutes till my whole backbone is going "crazy" and certain VMs, Router, Switches, etc. are not reachable anymore for around 10 seconds. At the same time I can see at my switch, where the vMotion pNICS are connected, that the links from the target Host are going down and come up a few seconds later (like a restart of the adapters)!

I don't thnik it's something related to an hardware defect because it happens regardless from Host1 or Host2

Does anybody have an idea where my problem is? or how to identify the source ?

Thanks for any comment

0 Kudos
27 Replies
rickardnobel
Champion
Champion

KKrtz wrote:

Does anybody have an idea where my problem is? or how to identify the source ?

It could be that your physical switches could not handle the load and causes this kind of disconnections. With multi-nic vMotion the amount of traffic sent between the hosts could be very high.

What kind of physical switch (or switches) do you have connected to the hosts?

My VMware blog: www.rickardnobel.se
0 Kudos
KKrtz
Contributor
Contributor

Cisco Catalyst 2960-S ... Backplane should be able to handle the traffic!

0 Kudos
rickardnobel
Champion
Champion

When you say that you lose connections to routers and other network devices, are that from inside the ESXi hosts or do you lose the connections as well from other network attached devices?

My VMware blog: www.rickardnobel.se
0 Kudos
michaelstump
Enthusiast
Enthusiast

I wonder if using the HP Customized ESXi ISO would help? Maybe this is a driver issue that only appears under heavy load?

Data Center Virtualization with VMware - theeagerzero.blogspot.com
0 Kudos
KKrtz
Contributor
Contributor

@richard noble: my network-monitoring sends me several alarms and aswell my client pc cannot reach some VMs in the LAN or WAN

@michaelstump: i am already using HP image

at the moment i am following some hint regarding unicast flooding and arp timeouts which are affecting the spanning tree infrastructure...did somebody had some experiences or hints about this?

0 Kudos
rickardnobel
Champion
Champion

KKrtz wrote:

at the moment i am following some hint regarding unicast flooding and arp timeouts which are affecting the spanning tree infrastructure...did somebody had some experiences or hints about this?

vMotion and ARP should not in any way impact Spanning Tree. However if you have some issue with your network card being disconnected by some error then Spanning Tree might make the situation worse, not the least if running the original and obsolete STP.

My VMware blog: www.rickardnobel.se
0 Kudos
KKrtz
Contributor
Contributor

as far as i understood the spanning tree can get flooded if the switch looses the destination mac in the table after an timeout...this can fill up the uplinks to the core...

0 Kudos
a_p_
Leadership
Leadership

A few thoughts:

  • please provide the physical switch ports configurations (i.e. show run int giX/X)
  • is your vMotion VLAN a routed or non-routed subnet
  • can you confirm the vSwitches as well as the port groups use default settings, except for the active/standby configuration
  • can you confirm none of the other VMKernel ports have "vMotion" accidentally enabled

André

0 Kudos
rickardnobel
Champion
Champion

KKrtz wrote:

as far as i understood the spanning tree can get flooded if the switch looses the destination mac in the table after an timeout...this can fill up the uplinks to the core...

Spanning Tree does not really care for any MAC forwarding, it has only the responsibility to make sure we have a loop free layer two topology.

The ordinary switch forwarding enginge has control over the MAC to port mappings and might flood frames if the tables get full. This should in normal cases just be for some small fraction of seconds, since once a reply gets in the switch re-learns the MAC-to-port.

My VMware blog: www.rickardnobel.se
0 Kudos
KKrtz
Contributor
Contributor

you are right ... and i will check if there is some debug possibilities! ... still i dont understand why the vmotion pnics links are going down

0 Kudos
KKrtz
Contributor
Contributor

@a.p.:

  • switchport config for all vmotion/esxi ports:
    switchport mode trunk
    switchport nonegotiate
    spanning-tree portfast trunk
  • vmotion is an non-routed vlan but available in the whole switch infrastructure
  • all vswitch have default settings except failover order
  • three vmknic are configured
    1x management / replication - vswitch1
    2x only vmotion - vswitch0
0 Kudos
a_p_
Leadership
Leadership

I don't see anything wrong with the settings.

Although very unlikely the issue, but to rule this out, it might be worth a try to configure the physical ports as access ports and remove the VLAN tag from the port groups.

André

0 Kudos
rickardnobel
Champion
Champion

Could you also double-check that you have the "Notify Switches" enabled on the vMotion VMkernel portgroup?

My VMware blog: www.rickardnobel.se
0 Kudos
KKrtz
Contributor
Contributor

The setting "notify switches" is activated on all vswitches in my environment!

0 Kudos
rickardnobel
Champion
Champion

KKrtz wrote:

my network-monitoring sends me several alarms and aswell my client pc cannot reach some VMs in the LAN or WAN

Is any of those alarms regarding physical devices that could not reach other physical devices?

My VMware blog: www.rickardnobel.se
0 Kudos
KKrtz
Contributor
Contributor

physical to physical have the same problems like vm's...

0 Kudos
rickardnobel
Champion
Champion

KKrtz wrote:

physical to physical have the same problems like vm's...

That is interesting of course, since that means it is just not only a logical problem inside the vSphere hosts, but something that really happens on your physical network.

Do you have access to the 2960 from CLI / Telnet / SSH? Could you look at the logs if something obvious does happen when you do the vMotion? Look for any Spanning Tree events, which should not happen, but if the host NIC for some reason is overloaded and disconnected then it might trigger a STP recalculation.

My VMware blog: www.rickardnobel.se
0 Kudos
KKrtz
Contributor
Contributor

Yes it is...I have to enable the debug for spanning tree on the involved switches and make an test. Unfortnetely this is not something what I can do quickly during working period 😉

0 Kudos
rickardnobel
Champion
Champion

KKrtz wrote:

Yes it is...I have to enable the debug for spanning tree on the involved switches and make an test. Unfortnetely this is not something what I can do quickly during working period 😉

You might not have to enable the debug mode, could you just run this command and paste the results?

show spanning-tree

My VMware blog: www.rickardnobel.se
0 Kudos