VMware Cloud Community
virtualg_uk
Leadership
Leadership

ARP works, ping does not between 2 VMs on different hosts

Random VMs at what seems to be random times stop being able to communicate with each other.

Example

VM1 on host 1

VM2 on host 2

VM1 cannot ping VM2

VM1 sees an ARP entry in it's ARP cache for VM2

VM2 does NOT see an APR entry in cache

VM1 is able to ping other VMs on the same network

VM2 is able to ping other VMs on the same network

VM1 is able to ping VM2 if they are on the same host

Cleared ARP cache on VM1 and it is removed successfully. Pinged VM again, but no response, however ARP is back in the cache as expected

To confirm, it is not just ping, it is a range of ports that is not working which should be.

No firewalls in the mix either

Also worth noting that this happens when the VMs are on different blades within the same chassis or on different chassis

Versions

ESXi & vCenter is 6.5 Update 1

Compute hardware is Dell FX2 Chassis with FC430 blades and dual FN2210 IO Modules

This happened after migrating VMs from older 5.5 Update 3 ESXi rack mount servers

HW versions and VMware tools have not yet been upgraded (not tested)

Any advise is appreciated


Graham | User Moderator | https://virtualg.uk
Tags (2)
14 Replies
iopsGent
Enthusiast
Enthusiast

any IP duplication?

Please consider marking this answer as "correct" or "helpful" if you think your questions have been answered.
Reply
0 Kudos
virtualg_uk
Leadership
Leadership

None Smiley Sad


Graham | User Moderator | https://virtualg.uk
Reply
0 Kudos
admin
Immortal
Immortal

I had same problem in Hp DL360 G9.i have upgraded Firmware and drives than after problem has been resolved . I would recommend  first update firmware and  vmtool.

if you will have same problem please let me know

Reply
0 Kudos
virtualg_uk
Leadership
Leadership

The blades and chassis are running the latest version. I have not checked the uplink switches but since this is happening within the chassis too I would not expect that they are part of the problem?


Graham | User Moderator | https://virtualg.uk
Reply
0 Kudos
a_p_
Leadership
Leadership

I've seen similar issues where port security was enabled on the physical switch ports, which limited the number of allowed MAC addresses per port.

André

SureshKumarMuth
Commander
Commander

In that case the VM should not ping any of the VMs , here the issue is only among two VMs while other VMs are pingable. correct me if I am wrong.

Regards,
Suresh
https://vconnectit.wordpress.com/
Reply
0 Kudos
virtualg_uk
Leadership
Leadership

The problem is from one VM to another VM.

VM1 and VM2 can ping other VMs on the network

If I vMotion VM2 to the same host that VM1 is using then the problem goes away, but if I vMotion VM2 to the host it was on before, the problem comes back

It happens on some VMs randomly, even VMs on different networks so I cannot see anything in common


Graham | User Moderator | https://virtualg.uk
Reply
0 Kudos
virtualg_uk
Leadership
Leadership

I'll take a look at that one and get back to you. Cheers


Graham | User Moderator | https://virtualg.uk
Reply
0 Kudos
virtualg_uk
Leadership
Leadership

That feature is disabled. Any other ideas?!


Graham | User Moderator | https://virtualg.uk
Reply
0 Kudos
SureshKumarMuth
Commander
Commander

Try packetcapture to see where it exactly gets dropped. That may give some clue.

Regards,
Suresh
https://vconnectit.wordpress.com/
Reply
0 Kudos
admin
Immortal
Immortal

Go into the edit settings and look at the MAC addresses. My guess is that  the MAC address is the same for all the VMs. You can shut down the machines and change the mac addresses to a static MAC address. You could also delete the nics one each, then add new nics. Remember to reconfigure the IP addresses on the box and delete the ghost nics after doing that.

Reply
0 Kudos
virtualg_uk
Leadership
Leadership

MACs are all different


Graham | User Moderator | https://virtualg.uk
Reply
0 Kudos
bspagna89
Hot Shot
Hot Shot

Hi,

I've actually had this same exact issue running Dell FX2 Chassis/blades. My config was slightly different. I had FC830s and pass-through modules instead of the FN switches. Anyway, the issue was none of that but actually in the NIC inside the blade. Can you confirm what NICs your blades are using? For us, we were using Intel X710 NICs and we experienced months worth of support calls and random issues... One of those issues happens to be what you're describing here. I spent several weeks troubleshooting this and it wasn't something I could easily reproduce.

Here's how I understood it....When you migrate a VM1 to another host, it updates the network that it moved (gARP i believe?). Servers know that VM1 has moved, including the VMs on the source host that it came from. The ISSUE is, when VM2 on the source host tries to communicate with VM1 (recently migrated) the HOST NICs are trying to do some magic behind the scenes and that's where the problem is. VMware support described that the traffic actually gets black holed because the source host still thinks VM1 resides there. This can be very confusing and takes a lot of effort to actually prove with packet captures etc. Let me know if you're having trouble following what I wrote here.

Unfortunately, the only viable solution for my client was to swap these out for different brand NICs.

Hope this helps.. Private message me if you need more info.

-Brian

New blog - https://virtualizeme.org/
TheHevy
Contributor
Contributor

Reply
0 Kudos