VMware Cloud Community
Sevich
Contributor
Contributor

VMotion momentary packet loss with Nexus 7000 switches

I know there are countless posts on this subject but I haven't come across the smoking gun for my issue.

We are running ESXi 4.1 and our hosts are directly cabled to a Nexus 7000 switch with 1GB copper blades. It seems after we moved things to the Nexus 7000 that we started getting long ping loss during vmotion. On our old Cisco switches we saw an immediate migration with only one ping lost. Now I see anywhere from a 5-20 second period during vmotion where the guest is not reachable. After that time period, the VM does come back, however being down for 5-20 seconds is a little problematic, especially since we only used to get one ping lost.

Our network guy watched the switch while I did a vmotion and as soon as the vm stopped pinging, he looked at the switch and the MAC was still on the old host as far as the switch was concerned. After the 5-20 second outage, the MAC went to the new port for the new host and everything was fine. We opened a case with Cisco and did some packet captures. We are not seeing a gratuitous ARP request being sent to the switch as Cisco seems to expect we should see. I found other forum posts on the community that seemed to suggest that because this was just layer 2, that ARP did not come in to play. Seeing as how I'm not a network engineer I'm not completely familiar with all the workings of the ARP and CAM tables but I wondered if there is anything VMware has to send to the switches at time of vmotion to tell that ip/mac has moved or how that was handled.

I wondered if there's anyone out there with our type on config that might also be having the same issues. It seems a little to me and my network engineers that the NX-OS still might be a little buggy and wouldn't be surprised if it might be something on the Cisco side. Also if anyone has some tests I could try that would be appreciated.

Thanks in advance.

Reply
0 Kudos
3 Replies
Sevich
Contributor
Contributor

If anyone else comes across this issue. We recently upgraded the NX-OS to the latest code and all is fine now. So must have been something on the Cisco side.

Reply
0 Kudos
wkucardinal
Contributor
Contributor

How old was your code before you upgraded?

Reply
0 Kudos
Sevich
Contributor
Contributor

I cant remember current revisions, but we usually stay on top of current releases. So I would say the code we were running was only a few rev's behind.

Reply
0 Kudos