We've got 4 ESX 3.0 boxes in two different datacenters. We can vmotion a VM from either of our two ESX boxes in our primary dataceter to either ESX box in the standby, and we only lose a packet or two. However when we vmotion the machine back, we loose connectivity for 5 minutes (almost exactly 5 minutes each time). Obviously, we span our network between the two datacenters.
The network engineer can see the MAC address move from interface to interface between the two buildings in the same manner regardless of the direction the VM is moving. Acording to him all the switch/router interfaces are updated and the "network" knows that the MAC address has moved properly. Yet, for some reason the VM cannot send or recieve data outside its own VLAN for 5 minutes after moving back to the primary datacenter.
We run tcpdump inside the VM and can see that it is still receiving IP and ARP broadcast traffic (that would be local vlan traffic), but established TCP and ICMP traffic outside the VLAN is gone for 5 minutes.
Here's another REALLY ODD thing: when the network engineer puts a sniffer on the ESX port with port mirroring, then the vmotion works perfectly, no outage. If the port-mirror is off (no sniffer) whe the VM is moving, we loose connectivity. If we turn the mirror/sniffer on during the connectivity loss then all connectivity is immediately restored.
Network is Cisco, we're doing 802.1q vlan tagging and 802.3ad teaming with ESX configured to use IP-hash.
At this point I'm fairly certain this is a network issue, but it still doesn't make sense to us. Anyone out there seen anything remotely like this or have advice?
Update: we physically unplugged one of the teamed ports on each node in the two datacenters in order to try to take out the adapter teaming IP-hash/mac hash issues. That didn't help. Traffic was OK prior to vmotion and died for 5 minutes after.
It would appear that this is a problem in the Cisco switch/routers. Anyone out there seen this with Cisco gear?
Message was edited by:
mcallistera
The network engineer can see the MAC address move from interface to interface between the two buildings in the same manner regardless of the direction the VM is moving. Acording to him all the switch/router interfaces are updated and the "network" knows that the MAC address has moved properly. Yet, for some reason the VM cannot send or recieve data outside its own VLAN for 5 minutes after moving back to the primary datacenter.
We run tcpdump inside the VM and can see that it is still receiving IP and ARP broadcast traffic (that would be local vlan traffic), but established TCP and ICMP traffic outside the VLAN is gone for 5 minutes.
Here's another REALLY ODD thing: when the network engineer puts a sniffer on the ESX port with port mirroring, then the vmotion works perfectly, no outage. If the port-mirror is off (no sniffer) whe the VM is moving, we loose connectivity. If we turn the mirror/sniffer on during the connectivity loss then all connectivity is immediately restored.
Network is Cisco, we're doing 802.1q vlan tagging and 802.3ad teaming with ESX configured to use IP-hash.
At this point I'm fairly certain this is a network issue, but it still doesn't make sense to us. Anyone out there seen anything remotely like this or have advice?
Update: we physically unplugged one of the teamed ports on each node in the two datacenters in order to try to take out the adapter teaming IP-hash/mac hash issues. That didn't help. Traffic was OK prior to vmotion and died for 5 minutes after.
It would appear that this is a problem in the Cisco switch/routers. Anyone out there seen this with Cisco gear?
Message was edited by:
mcallistera