1 2 3 Previous Next 35 Replies Latest reply on Nov 18, 2011 1:45 AM by rickardnobel

    Random virtual machines lose network after vMotion or Storage vMotion

    wkucardinal Novice

      Hi all -

       

      I have a couple of tickets open with VMware and our SAN vendor, EqualLogic, on this issue.  Since configuring our production and DMZ clusters we have been noticing that virtual machines will sometimes drop network connectivity after a successful vMotion or Storage vMotion.  Occasionally, though far less frequently, virtual machines will also spontaneously lose network over night.  This has only happened a few times.  The strange thing is that other guests on the VM host are fine - they do not lose network at all.  In fact, I can fail over 3 virtual machines from one host to another, and 2 of the 3 may fail over correctly, while one will lose network.  The workaround?  Simply "disconnect" the virtual NIC and "reconnect" it, and the VM will start returning packets.  I can also fail the troubled VM back over to the prior host and it will regain network.  I can reboot it and it will re-gain network.  I can re-install the virtual adapter completely, and it will re-gain network.

       

      VMware saw a bunch of SAN errors in our log files so we updated our SAN firmware to the latest version.  That seems to have fixed that but we still have the issue.  Here are some of the specs - all environments are virtually identical except for memory:

       

      PowerEdge R810's

      Broadcom 5709 NICs

      EqualLogic SAN running 5.0.5 F/W

       

      We are using jumbo frames.  ESXi is fully-patched.  I have not seen a pattern regarding whether or not it is only certain guest OS that lose network but we are primarily a Windows environment.

       

      When a virtual machine loses network, we cannot:

       

      • ping to it
      • ping from it
      • ping from it to virtual machines on the same host or vSwitch
      • ping outside our network
      • resolve DNS, etc.

       

      I have followed certain VMware KBs to no success, including:

       

      http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003839
      http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1002811 (Port Security is not enabled)

       

      -All VMware tools have been updated to the latest correct version and match the ESXi host
      -Logged onto the ESXi service console, I cannot ping the trouble VM by host name or by IP address, but I can ping OTHER virtual machines not experiencing the issue.  I also can ping external from the service console.
      -Logged into the troubled VM itself, I cannot ping other VMs, I cannot resolve host names, I cannot ping by IP.  The VM CAN ping itself by IP but not by hostname.  I cannot ping other VMs on the same virtual switch or network by either IP or host name.  I cannot ping the management network vSwitch.
      -All vSwitches are configured identically and named the same.
      -Notify switches is set to yes
      -There are plenty of available virtual ports
      -We have tried both E1000 and VMXNET virtual adapters with no difference.
      -All adapters are configured to negotiate, but we have tried forcing particular speeds as well with no difference

       

       

       

      I do appreciate your help.  I am having trouble getting anywhere on this issue with the vendors.

        • 1. Re: Random virtual machines lose network after vMotion or Storage vMotion
          nathanw Enthusiast

          I have had this issue previously and I am trying to recall exactly how we fixed it!

           

          Are you using vSwitches or Distributed switches?

          How are your uplinks configured? Please  advise your switch policy settings.

           

          I seem to recall that it was a mix of VM and physical network configuration issues, where when a VM migrated the network switches still had the traffic going to the old port where the MAC was last registered??

           

          how many hosts are in the cluster? are all vSwitch/Dist Switches uplinks connected and are all the VLANS correctly applied? obviously should be given that a disconnect and reconnect resolves the issue, this also forces a broadcast of the new location.

           

          I would be looking closer at your switch configuration, check the encapsulation for trunking and how notifications are handled.

           

          Good luck

          • 2. Re: Random virtual machines lose network after vMotion or Storage vMotion
            ats0401 Enthusiast

            What NIC teaming policy are you using? How are the ports set on the physical switch?

            I had a similiar problem where I had setup all my vswitches to use route based on IP Hash.

            The LAN admin had incorrectly configured the port channel groups in etherchannel so there was a lot of host flapping and all kinds of errors.

            What kind of switch are you connected to?

            I would get into the switch and monitor the console/log files in real time during the vmotion of when it works and when it stops working.

            • 3. Re: Random virtual machines lose network after vMotion or Storage vMotion
              wkucardinal Novice

              We are using both vSwitches and Distributed switches.  We do not have switch policy settings.

               

              Our vSwitch controls the management network and our iSCSI network.  The distributed switch covers our "business" network and our FT/vmotion network.

               

              To answer about the number of machines in the cluster, this is where things get interesting.

               

              We have:

               

              DMZ Cluster

              • 2 Hosts
              • 2 separate dmz switches
              • connected back to EqualLogic SAN through one of 3- 3750E catalyst switches

               

              Production Cluster

              • 3 Hosts
              • 2 core switches
              • connected back to same EQL SAN through the same 3 - 3750E catalyst switches

               

              We are seeing identical behavior in both environments, so it would seem that if it were a networking issue it probably is somewhere in that SAN connection as that is the main common factor.  This also might explain why when we just migrate storage, we sometimes lose connectivity.  Is there any gotchas we should look for here?

               

              The way the vSwitch is configured for iSCSI is the following:

               

              Network failure: Link status only

              Notify switches? Yes

              Failback: Yes

              Promiscuous Mode - Reject

              MAC Changes - Accept

              Forged Transmits - Accept

              • 4. Re: Random virtual machines lose network after vMotion or Storage vMotion
                wkucardinal Novice

                ATS -

                 

                 

                As far as I know we are not using etherchannel and we have things set up to Routed based on the originating virtual port ID.  For failover detection we have it set to link status only, notify switches yes, and failback yes.

                 

                As for teaming, we have all 4 SCSI connections as Active and then on each port we have each listed as active with the other 3 being unused.

                 

                Can a loss of network connectivity be tied to the SAN switch when the VM maintains a good connection with the SAN?  The VM itself never stops running.

                • 5. Re: Random virtual machines lose network after vMotion or Storage vMotion
                  Mouhamad Expert

                  Hello,

                   

                  The issue you're facing is not from VMware side. I beleive you have Cisco switches. What's happening is that when you are vMotioning from a host to another, the VMs will loose access to one NIC and try to access the new host's NICs. The problem is that the ARP tables on your switches are not getting cleared immediatly.

                   

                  This issue occur when you have an old Cisco IOS, update the IOS and you will be fine.

                   

                  Good luck..

                   

                  Regards,

                  • 6. Re: Random virtual machines lose network after vMotion or Storage vMotion
                    ats0401 Enthusiast

                    Can you post a port config from the cisco switch for an ISCI configured port and a network port?

                    Also did you check the switch logs?

                    I think you need to rule out the physical switches first

                    perhaps console into the switch and start a vmotion of a few VM's and see if any errors pop up on the switch console

                    • 7. Re: Random virtual machines lose network after vMotion or Storage vMotion
                      wkucardinal Novice

                      Mouhamad -

                       

                      What is the earliest iOS you would recommend on the Cisco switches?

                       

                      We have:

                       

                      Catalyst 3750E's for our iSCSI Connections

                      Catalyst 6500 for our core networking switches

                       

                      Our DMZ switches:

                       

                      Catalyst 2950's

                       

                       

                       

                      Would your explanation also explain why this issue occurs when doing either vMotion or Storage vMotion?

                      • 8. Re: Random virtual machines lose network after vMotion or Storage vMotion
                        wkucardinal Novice

                        ATS -

                         

                        We are going to test the arp flush procedure first.  I cannot post a port config but if you have specific questions I would be glad to tell you what we have set up.  We are going to watch the logs while we vmotion as well to see if it shows anything.  I think you guys are on to something with the arp cache. This would also explain why we sometimes have VMs lose network over night - we are using DRS so they probably have migrated.  I can't believe I did not think of tha tbefore!

                        • 9. Re: Random virtual machines lose network after vMotion or Storage vMotion
                          ats0401 Enthusiast

                          I agree with Mouhamad, it is most likely the physical switches that are the culprit. VMware and Equallogic are usually too stable to have problems like this.

                          Unless it's a really old IOS though, I think it would most likely be a configuration problem within the physical switches.

                          Are you doing the basic stuff? (enable spanning tree portfast, MTU 9000\flow control receive on iSCSI ports, etc)

                          no port channel settings on anything, correct?

                          • 10. Re: Random virtual machines lose network after vMotion or Storage vMotion
                            wkucardinal Novice

                            EqualLogic says not to use STP, we could use RSTP but we do not have that on.  Our IOS is about 5 revisions behind (September 2009).  Our MTU is at 9000 end-to-end.  Unicast storm control is on.

                             

                            We do have some port channel settings on our production switches, but they are not on the DMZ switches, so I don't think that's the problem.

                            • 11. Re: Random virtual machines lose network after vMotion or Storage vMotion
                              ats0401 Enthusiast

                              So have you disabled spanning tree on the switches? Usually you have to be extremely careful when disabling STP on the switch, not a good idea.

                              spanning-tree portfast tells the port to bypass all the STP modes and go straight to forwarding.

                              so verify if STP is indeed disabled; if not I recommend adding spanning-tree portfast or portfast trunk to each ESX port.

                               

                               

                              Are ANY of the physical nics from the ESX servers connected to a port on the switch that is part of a port channel group? If so this will cause exactly the type of problems you are describing. Make sure all ports for ESX servers have no port-channel mode on or any type of port channel setting.

                              • 12. Re: Random virtual machines lose network after vMotion or Storage vMotion
                                wkucardinal Novice

                                We have disabled spanning tree only on the switches that pass our iSCSI traffic (and only iSCSI traffic) and this is something I will send to our network guys to consider enabling.

                                 

                                So, let's simplify this a little bit, because we could be talking about network switches controlling our business network in our dmz or production cluster, or network switches that only handle iSCSI traffic to our SAN.  Would there be a reason that misconfiguration on the iSCSI physical switch would cause a specific VM to lose network connectivity after a vMotion or Storage vMotion?  I am trying to make sense of this in my head but it seems like there would be no connection.  The only thing I can figure is that when you vmotion a host for some reason the iSCSI switch could be thinking the VM is residing on the wrong host, but why would this break network connectivity?  Wouldn't it crash the VM itself?

                                 

                                I keep wanting to single out the iSCSI switch because it is the common denominator.  It is also running on very old IOS (12.2.35SE5 from 2007).

                                • 13. Re: Random virtual machines lose network after vMotion or Storage vMotion
                                  ats0401 Enthusiast

                                  I think it has nothing to do with the iSCSI switches at all. The VM would blue screen if it lost storage connectivity.

                                  I think its 100% to do with the production switch (6500 if I understand the setup correctly).

                                  My suggestion is to get your LAN administrator to sit with you and monitor the switch console and error logs in real time and throw one of the host into maintenance mode ( that should trigger a bunch of vmotions)

                                   

                                  If you are using VLANS and trunking the port config should look something like this

                                  (this is vmware and cisco recommendation)

                                  • interface GigabitEthernet1/1
                                  • description VMware ESX server 1 NIC 1
                                  • switchport trunk encapsulation dot1q
                                  • switchport trunk allowed vlan 100,200,300
                                  • switchport mode trunk
                                  • switchport nonegotiate
                                  • spanning-tree portfast trunk

                                   

                                  Did you verify there is no port channel applied to any ESX cisco ports?

                                  • 14. Re: Random virtual machines lose network after vMotion or Storage vMotion
                                    wkucardinal Novice

                                    There are no port channels for any of the ports.  I have scheduled some time with one of our network guys tomorrow to see what we can find.  Thanks for your help trying to sort this out.  I’ll post back results tomorrow.

                                    1 2 3 Previous Next