VMware Cloud Community
wkucardinal
Contributor
Contributor
Jump to solution

Random virtual machines lose network after vMotion or Storage vMotion

Hi all -

I have a couple of tickets open with VMware and our SAN vendor, EqualLogic, on this issue.  Since configuring our production and DMZ clusters we have been noticing that virtual machines will sometimes drop network connectivity after a successful vMotion or Storage vMotion.  Occasionally, though far less frequently, virtual machines will also spontaneously lose network over night.  This has only happened a few times.  The strange thing is that other guests on the VM host are fine - they do not lose network at all.  In fact, I can fail over 3 virtual machines from one host to another, and 2 of the 3 may fail over correctly, while one will lose network.  The workaround?  Simply "disconnect" the virtual NIC and "reconnect" it, and the VM will start returning packets.  I can also fail the troubled VM back over to the prior host and it will regain network.  I can reboot it and it will re-gain network.  I can re-install the virtual adapter completely, and it will re-gain network.

VMware saw a bunch of SAN errors in our log files so we updated our SAN firmware to the latest version.  That seems to have fixed that but we still have the issue.  Here are some of the specs - all environments are virtually identical except for memory:

PowerEdge R810's

Broadcom 5709 NICs

EqualLogic SAN running 5.0.5 F/W

We are using jumbo frames.  ESXi is fully-patched.  I have not seen a pattern regarding whether or not it is only certain guest OS that lose network but we are primarily a Windows environment.

When a virtual machine loses network, we cannot:

  • ping to it
  • ping from it
  • ping from it to virtual machines on the same host or vSwitch
  • ping outside our network
  • resolve DNS, etc.

I have followed certain VMware KBs to no success, including:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100383...
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1002811 (Port Security is not enabled)

-All VMware tools have been updated to the latest correct version and match the ESXi host
-Logged onto the ESXi service console, I cannot ping the trouble VM by host name or by IP address, but I can ping OTHER virtual machines not experiencing the issue.  I also can ping external from the service console.
-Logged into the troubled VM itself, I cannot ping other VMs, I cannot resolve host names, I cannot ping by IP.  The VM CAN ping itself by IP but not by hostname.  I cannot ping other VMs on the same virtual switch or network by either IP or host name.  I cannot ping the management network vSwitch.
-All vSwitches are configured identically and named the same.
-Notify switches is set to yes
-There are plenty of available virtual ports
-We have tried both E1000 and VMXNET virtual adapters with no difference.
-All adapters are configured to negotiate, but we have tried forcing particular speeds as well with no difference

I do appreciate your help.  I am having trouble getting anywhere on this issue with the vendors.

Reply
0 Kudos
1 Solution

Accepted Solutions
rickardnobel
Champion
Champion
Jump to solution

wkucardinal wrote:

Still having the issue.. Smiley Sad

Sometimes it could be useful to really verify that all VMNIC uplinks for all VLAN does work. One way to try this is to create a new portgroup on the vSwitch used by your VMs on the first host, put one test VM on this portgroup, then go into the NIC teaming policy of the new portgroup and select "Override switch failover order".

Then move down all VMNICs except one to unused, so only one VMNIC is active. Then set the portgroup VLAN settings to one of the production VLANs and try to see if you could ping some different expected addresses. If this works, then move the working VMNIC down to unused and move up another to Active. Try again, and do this for all VMNICs. If this works then you have verifed that the VLAN configuration and other settings are correct on the physical switch ports facing this host.

If having multiple VLANs, repeat the process for all other production VLANs. Then repeat the process on the other hosts.

While might take a while, does will verify if everything is correctly configured on the physical switches. When a vMotion takes place the VM gets a new "Port ID" and is assigned a new outgoing VMNIC. If there is a configuration error on one or several physical switch port this could seem random, but perhaps always happens on VLAN x on VMNIC y. Since the Port ID policy you are using in effect will randomly spread the VMs over the VMNICs these problems could be hard to diagnose. (Doing a disconnect of the Virtual Machine vNIC gives the VM a new port-ID which will move it to a new outgoing VMNIC, which might seem to solve the problem.)

My VMware blog: www.rickardnobel.se

View solution in original post

Reply
0 Kudos
35 Replies
nathanw
Enthusiast
Enthusiast
Jump to solution

I have had this issue previously and I am trying to recall exactly how we fixed it!

Are you using vSwitches or Distributed switches?

How are your uplinks configured? Please  advise your switch policy settings.

I seem to recall that it was a mix of VM and physical network configuration issues, where when a VM migrated the network switches still had the traffic going to the old port where the MAC was last registered??

how many hosts are in the cluster? are all vSwitch/Dist Switches uplinks connected and are all the VLANS correctly applied? obviously should be given that a disconnect and reconnect resolves the issue, this also forces a broadcast of the new location.

I would be looking closer at your switch configuration, check the encapsulation for trunking and how notifications are handled.

Good luck

Nathan VCP
Reply
0 Kudos
ats0401
Enthusiast
Enthusiast
Jump to solution

What NIC teaming policy are you using? How are the ports set on the physical switch?

I had a similiar problem where I had setup all my vswitches to use route based on IP Hash.

The LAN admin had incorrectly configured the port channel groups in etherchannel so there was a lot of host flapping and all kinds of errors.

What kind of switch are you connected to?

I would get into the switch and monitor the console/log files in real time during the vmotion of when it works and when it stops working.

Reply
0 Kudos
wkucardinal
Contributor
Contributor
Jump to solution

We are using both vSwitches and Distributed switches.  We do not have switch policy settings.

Our vSwitch controls the management network and our iSCSI network.  The distributed switch covers our "business" network and our FT/vmotion network.

To answer about the number of machines in the cluster, this is where things get interesting.

We have:

DMZ Cluster

  • 2 Hosts
  • 2 separate dmz switches
  • connected back to EqualLogic SAN through one of 3- 3750E catalyst switches

Production Cluster

  • 3 Hosts
  • 2 core switches
  • connected back to same EQL SAN through the same 3 - 3750E catalyst switches

We are seeing identical behavior in both environments, so it would seem that if it were a networking issue it probably is somewhere in that SAN connection as that is the main common factor.  This also might explain why when we just migrate storage, we sometimes lose connectivity.  Is there any gotchas we should look for here?

The way the vSwitch is configured for iSCSI is the following:

Network failure: Link status only

Notify switches? Yes

Failback: Yes

Promiscuous Mode - Reject

MAC Changes - Accept

Forged Transmits - Accept

Reply
0 Kudos
wkucardinal
Contributor
Contributor
Jump to solution

ATS -

As far as I know we are not using etherchannel and we have things set up to Routed based on the originating virtual port ID.  For failover detection we have it set to link status only, notify switches yes, and failback yes.

As for teaming, we have all 4 SCSI connections as Active and then on each port we have each listed as active with the other 3 being unused.

Can a loss of network connectivity be tied to the SAN switch when the VM maintains a good connection with the SAN?  The VM itself never stops running.

Reply
0 Kudos
Mouhamad
Expert
Expert
Jump to solution

Hello,

The issue you're facing is not from VMware side. I beleive you have Cisco switches. What's happening is that when you are vMotioning from a host to another, the VMs will loose access to one NIC and try to access the new host's NICs. The problem is that the ARP tables on your switches are not getting cleared immediatly.

This issue occur when you have an old Cisco IOS, update the IOS and you will be fine.

Good luck..

Regards,

VCP-DCV, VCP-DT, VCAP-DCD, VSP, VTSP
Reply
0 Kudos
ats0401
Enthusiast
Enthusiast
Jump to solution

Can you post a port config from the cisco switch for an ISCI configured port and a network port?

Also did you check the switch logs?

I think you need to rule out the physical switches first

perhaps console into the switch and start a vmotion of a few VM's and see if any errors pop up on the switch console

Reply
0 Kudos
wkucardinal
Contributor
Contributor
Jump to solution

Mouhamad -

What is the earliest iOS you would recommend on the Cisco switches?

We have:

Catalyst 3750E's for our iSCSI Connections

Catalyst 6500 for our core networking switches

Our DMZ switches:

Catalyst 2950's

Would your explanation also explain why this issue occurs when doing either vMotion or Storage vMotion?

Reply
0 Kudos
wkucardinal
Contributor
Contributor
Jump to solution

ATS -

We are going to test the arp flush procedure first.  I cannot post a port config but if you have specific questions I would be glad to tell you what we have set up.  We are going to watch the logs while we vmotion as well to see if it shows anything.  I think you guys are on to something with the arp cache. This would also explain why we sometimes have VMs lose network over night - we are using DRS so they probably have migrated.  I can't believe I did not think of tha tbefore!

Reply
0 Kudos
ats0401
Enthusiast
Enthusiast
Jump to solution

I agree with Mouhamad, it is most likely the physical switches that are the culprit. VMware and Equallogic are usually too stable to have problems like this.

Unless it's a really old IOS though, I think it would most likely be a configuration problem within the physical switches.

Are you doing the basic stuff? (enable spanning tree portfast, MTU 9000\flow control receive on iSCSI ports, etc)

no port channel settings on anything, correct?

Reply
0 Kudos
wkucardinal
Contributor
Contributor
Jump to solution

EqualLogic says not to use STP, we could use RSTP but we do not have that on.  Our IOS is about 5 revisions behind (September 2009).  Our MTU is at 9000 end-to-end.  Unicast storm control is on.

We do have some port channel settings on our production switches, but they are not on the DMZ switches, so I don't think that's the problem.

Reply
0 Kudos
ats0401
Enthusiast
Enthusiast
Jump to solution

So have you disabled spanning tree on the switches? Usually you have to be extremely careful when disabling STP on the switch, not a good idea.

spanning-tree portfast tells the port to bypass all the STP modes and go straight to forwarding.

so verify if STP is indeed disabled; if not I recommend adding spanning-tree portfast or portfast trunk to each ESX port.

Are ANY of the physical nics from the ESX servers connected to a port on the switch that is part of a port channel group? If so this will cause exactly the type of problems you are describing. Make sure all ports for ESX servers have no port-channel mode on or any type of port channel setting.

Reply
0 Kudos
wkucardinal
Contributor
Contributor
Jump to solution

We have disabled spanning tree only on the switches that pass our iSCSI traffic (and only iSCSI traffic) and this is something I will send to our network guys to consider enabling.

So, let's simplify this a little bit, because we could be talking about network switches controlling our business network in our dmz or production cluster, or network switches that only handle iSCSI traffic to our SAN.  Would there be a reason that misconfiguration on the iSCSI physical switch would cause a specific VM to lose network connectivity after a vMotion or Storage vMotion?  I am trying to make sense of this in my head but it seems like there would be no connection.  The only thing I can figure is that when you vmotion a host for some reason the iSCSI switch could be thinking the VM is residing on the wrong host, but why would this break network connectivity?  Wouldn't it crash the VM itself?

I keep wanting to single out the iSCSI switch because it is the common denominator.  It is also running on very old IOS (12.2.35SE5 from 2007).

Reply
0 Kudos
ats0401
Enthusiast
Enthusiast
Jump to solution

I think it has nothing to do with the iSCSI switches at all. The VM would blue screen if it lost storage connectivity.

I think its 100% to do with the production switch (6500 if I understand the setup correctly).

My suggestion is to get your LAN administrator to sit with you and monitor the switch console and error logs in real time and throw one of the host into maintenance mode ( that should trigger a bunch of vmotions)

If you are using VLANS and trunking the port config should look something like this

(this is vmware and cisco recommendation)

  • interface GigabitEthernet1/1
  • description VMware ESX server 1 NIC 1
  • switchport trunk encapsulation dot1q
  • switchport trunk allowed vlan 100,200,300
  • switchport mode trunk
  • switchport nonegotiate
  • spanning-tree portfast trunk

Did you verify there is no port channel applied to any ESX cisco ports?

Reply
0 Kudos
wkucardinal
Contributor
Contributor
Jump to solution

There are no port channels for any of the ports. I have scheduled some time with one of our network guys tomorrow to see what we can find. Thanks for your help trying to sort this out. I’ll post back results tomorrow.

Reply
0 Kudos
Mouhamad
Expert
Expert
Jump to solution

Sounds like a good plan, all the luck..

VCP-DCV, VCP-DT, VCAP-DCD, VSP, VTSP
Reply
0 Kudos
wkucardinal
Contributor
Contributor
Jump to solution

Continuing to troubleshoot but here is an update:

As mentioned, we have two different sets of networking switches in play here, a DMZ set and production internal network switches (2 different vmware clusters).  Both environments are showing similar behavior, but we also have a VMware DR environment that is NOT showing the the problem with network connectivity.  Also, the machines in that environment vMotion much faster (losing 1 packet,

In none of our switches are we using trunking.  The only difference between our production internal switches and our DR switches is that the production switches have addtional settings for QOS.  However, the DMZ switches do NOT have the settings for QOS, and they show the same problem behavior as the production environment.

Reply
0 Kudos
wkucardinal
Contributor
Contributor
Jump to solution

Do you guys have a recommended port configuration for the ports in our SAN switch connected up to the ESXi hosts?

Reply
0 Kudos
nathanw
Enthusiast
Enthusiast
Jump to solution

PortFast you must have port fast enabled on all of your ESX uplinks, if not this is not likley where you issue will lay.

Nathan VCP
Reply
0 Kudos
afertmann
Contributor
Contributor
Jump to solution

I Agree.  We had the same issue.  Portfast was the culprit.  Enable Portfast on all VMport group uplinks.

Reply
0 Kudos