11 Replies Latest reply on Jun 14, 2012 1:14 AM by kopper27

    vMotion causing Unicast Flooding

    rayvd Enthusiast

      I am troubleshooting an environment generating unicast flooding during vMotion'ing.  The environment isn't exactly best practice (vMotion IP, Management IP and Virtual Machine Networks are all on the same logical subnet).  This will be corrected, but I'm trying to understand how and why the unicast flooding is occurring.

       

      We have Dell blades (M1000e) in a Dell chassis with multiple blade center switches.  Each of these switches uplinks to Cisco gear.  When we do a vMotion between a host in the Blade Center and an external host, things work OK for a few minutes, but then unicast flooding begins.  I can send out an ARP request for the vMotion IP (target IP) from a non-involved host -- this seems to add the corresponding MAC address back to the Cisco's CAM table at which point the unicast flooding stops.

       

      Based on my observations, the MAC address is present in the Dynamic Address List on the Dell switch (equivalent of CAM table).  So why is the Cisco expiring it (presumably after 600 seconds have passed)?

       

      I encountered this post which asserts:

       

      Make sure you have all your vkernel ports on separate subnets e.g.  separate vmotion/management/iscsi. Failure to do this can cause lots of flooding during vmotion as the  physical switch does not learn the MAC address for the vmotion port  correctly. And continuously broadcasts to find it.

       

      Is this true?  Known bug?  Is there any documentation describing this?  My theory was that perhaps the ESXi server acting as the vMotion source has the destination IP in its ARP table.  It hasn't expired when the Cisco prunes on the MAC address from its CAM table and keeps transmitting without sending an ARP request.  The Cisco no longer knows where to send the packet so it Unicast Floods.

       

      Am I way off in the weeds here?

       

      Thanks!

        • 1. Re: vMotion causing Unicast Flooding
          rayvd Enthusiast

          I have noticed an interesting phenomenon after doing some packet captures that likely explain what's going on.

           

          We have two ports defined on the "target" ESXi 4.1 server (the server where the VM is being vMotioned to):

           

          • vMotion: 10.49.2.49 (00:50:56:74:30:28) [vMotion Enabled]
          • Management Network: 10.49.2.33 (00:22:19:94:88:f8) [Management Traffic Enabled]

           

          Both of these "ports" are on the same vmnic on the same subnet.

           

          My packet capture indicates the following vMotion packets from the source:


          20:07:33.351805 a4:ba:db:2d:3e:9a > 00:50:56:74:30:28, ethertype IPv4 (0x0800), length 1514: IP 10.49.5.155.59601 > 10.49.2.49.8000: . 45982380:45983828(1448) ack 1 win 4163 <nop,nop,timestamp 416800893 3266226>
          20:07:33.351809 a4:ba:db:2d:3e:9a > 00:50:56:74:30:28, ethertype IPv4 (0x0800), length 1514: IP 10.49.5.155.59601 > 10.49.2.49.8000: . 45983828:45985276(1448) ack 1 win 4163 <nop,nop,timestamp 416800893 3266226>
          

           

          This is the destination MAC I would expect to see (00:50:56:74:30:28).  However, when observing the TCP ACK's:

           

          20:13:33.202985 00:22:19:94:88:f8 > a4:ba:db:2d:3e:9a, ethertype IPv4 (0x0800), length 66: 10.49.2.49.irdmi > 10.49.5.155.62878: . ack 503189697 win 34390 <nop,nop,timestamp 3302215 416836876>
          20:13:33.202986 00:22:19:94:88:f8 > a4:ba:db:2d:3e:9a, ethertype IPv4 (0x0800), length 66: 10.49.2.49.irdmi > 10.49.5.155.62878: . ack 503192593 win 34209 <nop,nop,timestamp 3302215 416836876>
          

           

          You can clearly see that the the source MAC is now 00:22:19:94:88:f8 -- the MAC address bound to the Management Network (10.49.2.33).

           

          So why in the world don't the ACK's have the correct MAC address?  This is undoubtedly why the CAM table drops the correct MAC, and the true source of our unicast flooding...

          • 2. Re: vMotion causing Unicast Flooding
            mastrboy Novice

            We are currently experiencing the same problem with vmotion and ESXi 4.1 U1, this problem did not exist on ESX 4.0 Update 2. (We are currently migrating from esx to esxi)

             

            I was kind of shocked to find that the vmotion port was the cause of the unicast flooding we experienced.

             

            Did you find a "solution" to this, other than having a dedicated nic for management and a dedicated nic for vmotion seperated?

             

            Seems like more people are experiencing this with esxi 4.1: http://serverfault.com/questions/197918/clearing-arp-cache-on-esxi-4-1

            • 3. Re: vMotion causing Unicast Flooding
              mastrboy Novice

              A workaround that i found is setting the "switchport block unicast" on our cisco switches, but i can't consider this a good solution.

              • 4. Re: vMotion causing Unicast Flooding
                fletch00 Hot Shot

                We also are experiencing severe network disruptions during vMotions involving ESXi 4.1 U1 hosts.

                Is there a documented BUG for this for ESXi 4.1 U1?

                 

                thanks,

                http://vmadmin.info

                • 6. Re: vMotion causing Unicast Flooding
                  Rob-SSE Expert

                  fletch00,

                   

                  Please let us know the results of the case.  I have our migration from ESX 3.5 to ESXi 4.1 u1 coming up soon and did not plan to use a different subnetwork for vmotion.

                   

                  Thanks,

                   

                  Robert

                  • 7. Re: vMotion causing Unicast Flooding
                    fletch00 Hot Shot

                    VMware support helped resolve the issue with zero downtime and our config was brought inline with best practice recommendations -

                     

                    What I thought would be a large network topolgy change involving downtime turned out to be a live virtual networking config change:

                     

                    http://www.vmadmin.info/2011/04/vmotion-unicast-flood-esxi.html

                     

                    I recommended they create a KB for the solution as well

                    • 8. Re: vMotion causing Unicast Flooding
                      Lurker

                      Hi,

                       

                      as someone mentioned before, to solve that problem and avoid that type of difficulties in future you have to re-design  your IP space - you have to separate VMkernel IP form Management IP range.

                       

                       

                      After that you can check connectivity between vmotion VMkernel ports by command

                       

                       

                          # vmkping x.x.x.x   - where x.x.x.x is VMkernel IP address of other ESX's

                       

                       

                      To clarify: it is not problem with ESX 4.0 or 4.1. Issue can occur after you igrate from ESX Classic to ESXi. I classic management interface is own by the COS Kernel while other vmknic ports (for example  vMotion, iSCSI,) are own by the VMKernel. In that case management traffic (vswif on ESX Classic) can be in same IP subnet as any of the vmkernel NICs, since they would belong to different kernels.

                       

                       

                      Once migrated to ESXi the management network would become another vmknic, and therefore colliding with an existing vmknic's IP subnet.

                      • 9. Re: vMotion causing Unicast Flooding
                        kopper27 Expert

                        someone knows if this affects ESXi 4.1 Update 2?

                         

                        http://www.vmadmin.info/2011/04/vmotion-unicast-flood-esxi.html

                         

                        According to this

                         

                        Right now my Management and vMotion are like this (2 hosts)

                         

                        Hosts 1

                        Management - 192.168.23.240

                        vMotion - 192.168.23.241

                        Gateway 192.168.23.1

                         

                        Host 2

                        Management - 192.168.23.242

                        vMotion - 192.168.23.243

                        Gateway 192.168.23.1

                         

                         

                        so I should create a vMotion with 10.10.10.x ???? for instance?

                         

                        vMotion host 1 : 10.10.10.5

                        vMotion host 2 : 10.10.10.6

                         

                        and same Gateway 192.168.23.1

                         

                        or

                        something like this might be enough?

                         

                        vMotion host 1 : 192.168.30.5

                        vMotion host 2 : 192.168.30.6

                         

                        Let me know guys

                        thanks a lot

                        • 10. Re: vMotion causing Unicast Flooding
                          Lurker

                          Hi,

                           

                          If you are using 24 bit subment mask (255.255.255.0), you do not neet to change full IP class - just use simply 192.168.22.241.

                           

                          Cheerrs,

                          • 11. Re: vMotion causing Unicast Flooding
                            kopper27 Expert

                            got it

                             

                            so 192.168.22.x is enough.... I will try that and let us know

                             

                            what I see in the blog link is the guy is using 10.10.10.x and the initial was 17.x.x.x

                             

                            it's only needed another subnet in the same Class either A B or C