Migrating a VM from ESX 2.5.4 to ESX 3.0.1, using VMotion and DMotion, causes other, but not all, servers (physical or virtual) on the same subnet to no longer be able to ping the VM that was migrated.
What we notice:
tcpdump from the server sending the ping to the migrated VM will display the arp request to the migrated VM but it never receives a reply
tcpdump from the migrated VM will not display any arp request from the server sending the ping
ping -b of the broadcast ip on the subnet followed by arp -a will only display 10-20 other servers on the subnet, normally if we ping the broadcast and arp -a it will display all the other servers on the subnet.
If we ping from the migrated VM to the server attempting the ping of the migrated VM then this resolves the problem; however it is not practically for us to determine which IP numbers cannot ping a migrated VM. We also notice resolution when the migrated VM is then migrated again to another 3.0.1 server.
We are using HP Procurve 4104gl switches, vlan tagging and nic teaming.
We had exactly this issue but when VMotioning a VM in ESX 2.5.x..?
does vmkping work ok..?
Yes vmkping works, although there is a pause before it responds:
vmkping -D
PING 192.168.210.13 (192.168.210.13): 56 data bytes
64 bytes from 192.168.210.13: icmp_seq=0 ttl=64 time=0.136 ms
64 bytes from 192.168.210.13: icmp_seq=1 ttl=64 time=0.068 ms
64 bytes from 192.168.210.13: icmp_seq=2 ttl=64 time=0.066 ms
\--- 192.168.210.13 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.066/0.090/0.136 ms
PING 192.168.210.13 (192.168.210.13): 56 data bytes
64 bytes from 192.168.210.13: icmp_seq=0 ttl=64 time=0.066 ms
64 bytes from 192.168.210.13: icmp_seq=1 ttl=64 time=0.067 ms
64 bytes from 192.168.210.13: icmp_seq=2 ttl=64 time=0.066 ms
\--- 192.168.210.13 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.066/0.066/0.067 ms
VMotion does not appear to be at the root of this issue since I tried a cold migration and get the same problem.
This turned out to be a switch problem. We rebooted the switch and the problem went away.