rayvd
Enthusiast
Enthusiast

vMotion causing Unicast Flooding

I am troubleshooting an environment generating unicast flooding during vMotion'ing.  The environment isn't exactly best practice (vMotion IP, Management IP and Virtual Machine Networks are all on the same logical subnet).  This will be corrected, but I'm trying to understand how and why the unicast flooding is occurring.

We have Dell blades (M1000e) in a Dell chassis with multiple blade center switches.  Each of these switches uplinks to Cisco gear.  When we do a vMotion between a host in the Blade Center and an external host, things work OK for a few minutes, but then unicast flooding begins.  I can send out an ARP request for the vMotion IP (target IP) from a non-involved host -- this seems to add the corresponding MAC address back to the Cisco's CAM table at which point the unicast flooding stops.

Based on my observations, the MAC address is present in the Dynamic Address List on the Dell switch (equivalent of CAM table).  So why is the Cisco expiring it (presumably after 600 seconds have passed)?

I encountered this post which asserts:

Make sure you have all your vkernel ports on separate subnets e.g.  separate vmotion/management/iscsi. Failure to do this can cause lots of flooding during vmotion as the  physical switch does not learn the MAC address for the vmotion port  correctly. And continuously broadcasts to find it.

Is this true?  Known bug?  Is there any documentation describing this?  My theory was that perhaps the ESXi server acting as the vMotion source has the destination IP in its ARP table.  It hasn't expired when the Cisco prunes on the MAC address from its CAM table and keeps transmitting without sending an ARP request.  The Cisco no longer knows where to send the packet so it Unicast Floods.

Am I way off in the weeds here?

Thanks!

0 Kudos
11 Replies
rayvd
Enthusiast
Enthusiast

I have noticed an interesting phenomenon after doing some packet captures that likely explain what's going on.

We have two ports defined on the "target" ESXi 4.1 server (the server where the VM is being vMotioned to😞

  • vMotion: 10.49.2.49 (00:50:56:74:30:28) [vMotion Enabled]
  • Management Network: 10.49.2.33 (00:22:19:94:88:f8) [Management Traffic Enabled]

Both of these "ports" are on the same vmnic on the same subnet.

My packet capture indicates the following vMotion packets from the source:


20:07:33.351805 a4:ba:db:2d:3e:9a > 00:50:56:74:30:28, ethertype IPv4 (0x0800), length 1514: IP 10.49.5.155.59601 > 10.49.2.49.8000: . 45982380:45983828(1448) ack 1 win 4163 <nop,nop,timestamp 416800893 3266226>
20:07:33.351809 a4:ba:db:2d:3e:9a > 00:50:56:74:30:28, ethertype IPv4 (0x0800), length 1514: IP 10.49.5.155.59601 > 10.49.2.49.8000: . 45983828:45985276(1448) ack 1 win 4163 <nop,nop,timestamp 416800893 3266226>

This is the destination MAC I would expect to see (00:50:56:74:30:28).  However, when observing the TCP ACK's:

20:13:33.202985 00:22:19:94:88:f8 > a4:ba:db:2d:3e:9a, ethertype IPv4 (0x0800), length 66: 10.49.2.49.irdmi > 10.49.5.155.62878: . ack 503189697 win 34390 <nop,nop,timestamp 3302215 416836876>
20:13:33.202986 00:22:19:94:88:f8 > a4:ba:db:2d:3e:9a, ethertype IPv4 (0x0800), length 66: 10.49.2.49.irdmi > 10.49.5.155.62878: . ack 503192593 win 34209 <nop,nop,timestamp 3302215 416836876>

You can clearly see that the the source MAC is now 00:22:19:94:88:f8 -- the MAC address bound to the Management Network (10.49.2.33).

So why in the world don't the ACK's have the correct MAC address?  This is undoubtedly why the CAM table drops the correct MAC, and the true source of our unicast flooding...

0 Kudos
mastrboy
Contributor
Contributor

We are currently experiencing the same problem with vmotion and ESXi 4.1 U1, this problem did not exist on ESX 4.0 Update 2. (We are currently migrating from esx to esxi)

I was kind of shocked to find that the vmotion port was the cause of the unicast flooding we experienced.

Did you find a "solution" to this, other than having a dedicated nic for management and a dedicated nic for vmotion seperated?

Seems like more people are experiencing this with esxi 4.1: http://serverfault.com/questions/197918/clearing-arp-cache-on-esxi-4-1

0 Kudos
mastrboy
Contributor
Contributor

A workaround that i found is setting the "switchport block unicast" on our cisco switches, but i can't consider this a good solution.

0 Kudos
fletch00
Enthusiast
Enthusiast

We also are experiencing severe network disruptions during vMotions involving ESXi 4.1 U1 hosts.

Is there a documented BUG for this for ESXi 4.1 U1?

thanks,

http://vmadmin.info

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info
0 Kudos
fletch00
Enthusiast
Enthusiast

Just opened a case after reading this thread:

Wrong MAC address used for replies during vMotion

http://communities.vmware.com/thread/307130

http://vmadmin.info

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info
0 Kudos
Walfordr
Expert
Expert

fletch00,

Please let us know the results of the case.  I have our migration from ESX 3.5 to ESXi 4.1 u1 coming up soon and did not plan to use a different subnetwork for vmotion.

Thanks,

Robert

Robert -- BSIT, VCP3/VCP4, A+, MCP (Wow I haven't updated my profile since 4.1 days) -- Please consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
fletch00
Enthusiast
Enthusiast

VMware support helped resolve the issue with zero downtime and our config was brought inline with best practice recommendations -

What I thought would be a large network topolgy change involving downtime turned out to be a live virtual networking config change:

http://www.vmadmin.info/2011/04/vmotion-unicast-flood-esxi.html

I recommended they create a KB for the solution as well

VCP5 VSP5 VTSP5 vExpert http://vmadmin.info
0 Kudos
admin
Immortal
Immortal

Hi,

as someone mentioned before, to solve that problem and avoid that type of difficulties in future you have to re-design  your IP space - you have to separate VMkernel IP form Management IP range.

After that you can check connectivity between vmotion VMkernel ports by command

    # vmkping x.x.x.x   - where x.x.x.x is VMkernel IP address of other ESX's

To clarify: it is not problem with ESX 4.0 or 4.1. Issue can occur after you igrate from ESX Classic to ESXi. I classic management interface is own by the COS Kernel while other vmknic ports (for example  vMotion, iSCSI,) are own by the VMKernel. In that case management traffic (vswif on ESX Classic) can be in same IP subnet as any of the vmkernel NICs, since they would belong to different kernels.

Once migrated to ESXi the management network would become another vmknic, and therefore colliding with an existing vmknic's IP subnet.

0 Kudos
kopper27
Hot Shot
Hot Shot

someone knows if this affects ESXi 4.1 Update 2?

http://www.vmadmin.info/2011/04/vmotion-unicast-flood-esxi.html

According to this

Right now my Management and vMotion are like this (2 hosts)

Hosts 1

Management - 192.168.23.240

vMotion - 192.168.23.241

Gateway 192.168.23.1

Host 2

Management - 192.168.23.242

vMotion - 192.168.23.243

Gateway 192.168.23.1

so I should create a vMotion with 10.10.10.x ???? for instance?

vMotion host 1 : 10.10.10.5

vMotion host 2 : 10.10.10.6

and same Gateway 192.168.23.1

or

something like this might be enough?

vMotion host 1 : 192.168.30.5

vMotion host 2 : 192.168.30.6

Let me know guys

thanks a lot

0 Kudos
admin
Immortal
Immortal

Hi,

If you are using 24 bit subment mask (255.255.255.0), you do not neet to change full IP class - just use simply 192.168.22.241.

Cheerrs,

0 Kudos
kopper27
Hot Shot
Hot Shot

got it

so 192.168.22.x is enough.... I will try that and let us know

what I see in the blog link is the guy is using 10.10.10.x and the initial was 17.x.x.x

it's only needed another subnet in the same Class either A B or C

0 Kudos