I am troubleshooting an environment generating unicast flooding during vMotion'ing. The environment isn't exactly best practice (vMotion IP, Management IP and Virtual Machine Networks are all on the same logical subnet). This will be corrected, but I'm trying to understand how and why the unicast flooding is occurring.
We have Dell blades (M1000e) in a Dell chassis with multiple blade center switches. Each of these switches uplinks to Cisco gear. When we do a vMotion between a host in the Blade Center and an external host, things work OK for a few minutes, but then unicast flooding begins. I can send out an ARP request for the vMotion IP (target IP) from a non-involved host -- this seems to add the corresponding MAC address back to the Cisco's CAM table at which point the unicast flooding stops.
Based on my observations, the MAC address is present in the Dynamic Address List on the Dell switch (equivalent of CAM table). So why is the Cisco expiring it (presumably after 600 seconds have passed)?
I encountered this post which asserts:
Make sure you have all your vkernel ports on separate subnets e.g. separate vmotion/management/iscsi. Failure to do this can cause lots of flooding during vmotion as the physical switch does not learn the MAC address for the vmotion port correctly. And continuously broadcasts to find it.
Is this true? Known bug? Is there any documentation describing this? My theory was that perhaps the ESXi server acting as the vMotion source has the destination IP in its ARP table. It hasn't expired when the Cisco prunes on the MAC address from its CAM table and keeps transmitting without sending an ARP request. The Cisco no longer knows where to send the packet so it Unicast Floods.
Am I way off in the weeds here?
I have noticed an interesting phenomenon after doing some packet captures that likely explain what's going on.
We have two ports defined on the "target" ESXi 4.1 server (the server where the VM is being vMotioned to😞
Both of these "ports" are on the same vmnic on the same subnet.
My packet capture indicates the following vMotion packets from the source:
20:07:33.351805 a4:ba:db:2d:3e:9a > 00:50:56:74:30:28, ethertype IPv4 (0x0800), length 1514: IP 10.49.5.155.59601 > 10.49.2.49.8000: . 45982380:45983828(1448) ack 1 win 4163 <nop,nop,timestamp 416800893 3266226>
20:07:33.351809 a4:ba:db:2d:3e:9a > 00:50:56:74:30:28, ethertype IPv4 (0x0800), length 1514: IP 10.49.5.155.59601 > 10.49.2.49.8000: . 45983828:45985276(1448) ack 1 win 4163 <nop,nop,timestamp 416800893 3266226>
This is the destination MAC I would expect to see (00:50:56:74:30:28). However, when observing the TCP ACK's:
20:13:33.202985 00:22:19:94:88:f8 > a4:ba:db:2d:3e:9a, ethertype IPv4 (0x0800), length 66: 10.49.2.49.irdmi > 10.49.5.155.62878: . ack 503189697 win 34390 <nop,nop,timestamp 3302215 416836876>
20:13:33.202986 00:22:19:94:88:f8 > a4:ba:db:2d:3e:9a, ethertype IPv4 (0x0800), length 66: 10.49.2.49.irdmi > 10.49.5.155.62878: . ack 503192593 win 34209 <nop,nop,timestamp 3302215 416836876>
You can clearly see that the the source MAC is now 00:22:19:94:88:f8 -- the MAC address bound to the Management Network (10.49.2.33).
So why in the world don't the ACK's have the correct MAC address? This is undoubtedly why the CAM table drops the correct MAC, and the true source of our unicast flooding...
We are currently experiencing the same problem with vmotion and ESXi 4.1 U1, this problem did not exist on ESX 4.0 Update 2. (We are currently migrating from esx to esxi)
I was kind of shocked to find that the vmotion port was the cause of the unicast flooding we experienced.
Did you find a "solution" to this, other than having a dedicated nic for management and a dedicated nic for vmotion seperated?
Seems like more people are experiencing this with esxi 4.1: http://serverfault.com/questions/197918/clearing-arp-cache-on-esxi-4-1
We also are experiencing severe network disruptions during vMotions involving ESXi 4.1 U1 hosts.
Is there a documented BUG for this for ESXi 4.1 U1?
Please let us know the results of the case. I have our migration from ESX 3.5 to ESXi 4.1 u1 coming up soon and did not plan to use a different subnetwork for vmotion.
VMware support helped resolve the issue with zero downtime and our config was brought inline with best practice recommendations -
What I thought would be a large network topolgy change involving downtime turned out to be a live virtual networking config change:
I recommended they create a KB for the solution as well
as someone mentioned before, to solve that problem and avoid that type of difficulties in future you have to re-design your IP space - you have to separate VMkernel IP form Management IP range.
After that you can check connectivity between vmotion VMkernel ports by command
# vmkping x.x.x.x - where x.x.x.x is VMkernel IP address of other ESX's
To clarify: it is not problem with ESX 4.0 or 4.1. Issue can occur after you igrate from ESX Classic to ESXi. I classic management interface is own by the COS Kernel while other vmknic ports (for example vMotion, iSCSI,) are own by the VMKernel. In that case management traffic (vswif on ESX Classic) can be in same IP subnet as any of the vmkernel NICs, since they would belong to different kernels.
Once migrated to ESXi the management network would become another vmknic, and therefore colliding with an existing vmknic's IP subnet.
someone knows if this affects ESXi 4.1 Update 2?
According to this
Right now my Management and vMotion are like this (2 hosts)
Management - 192.168.23.240
vMotion - 192.168.23.241
Management - 192.168.23.242
vMotion - 192.168.23.243
so I should create a vMotion with 10.10.10.x ???? for instance?
vMotion host 1 : 10.10.10.5
vMotion host 2 : 10.10.10.6
and same Gateway 192.168.23.1
something like this might be enough?
vMotion host 1 : 192.168.30.5
vMotion host 2 : 192.168.30.6
Let me know guys
thanks a lot
so 192.168.22.x is enough.... I will try that and let us know
what I see in the blog link is the guy is using 10.10.10.x and the initial was 17.x.x.x
it's only needed another subnet in the same Class either A B or C