One of my customers has a VxRail System with 8 nodes (6.5 U1 7388607) and one of those nodes has a strange issue where it cannot get a valid ARP entry for 2 hosts in it's same subnet. It can for the other 18 or so hosts in that subnet. One of the ones that fail is the default gateway which caused the node to lose contact with vCenter (lives in another vlan).
vmk2 is the management interface. vmk0 is used for a VxRail internal thing so ignore that for this discussion.
To be perfectly clear: the ARP table is complete on all the hosts in the same subnet except 2: the default GW's address and one particular VM. The other 7 nodes have no problems at all. The problem started a week ago, out of the blue. These 8 nodes have been running perfectly fine for more than a year (same versions the entire time).
To make troubleshooting easier, I took vmnic1 down (the other interface in the pair using "Route based on originating virtual port". So "esxcli network nic list" will show only vmnic0 is currently active (swapping active interfaces does not help).
I've done traces with pktcap-uw. If I snif at the vmk2 level, I see ARP broadcasts going out ("who has <ip of default gw>, tell <vmk2 ip>" etc.) but the sniffer on the network does not see those packets getting on the physical network at all.
So when I tell pktcap-uw to snif on the uplink level, no such packets are going out. They get "lost" somewhere between vmk2 and vmnic0 (is the DvSwitch eating them?...)
In other words, those ARP request packets are not leaving the host at all. The leave vmk2 but have disappeared before hitting the wire.
The corresponding physical networkports on the switches are configured as trunk-ports. All looks good from that side.
The problem is the same as "issue 1" in this blogpost: ESXi 6.5 host cannot resolve specific MAC-address when on trunk port with Intel quadport X710 nic | ...
For the blogger, updating the NIC driver resolved it for him. He has a very different NIC but it's the general idea that counts.
These VxRail Nodes are currently using the IXGBEN driver v1.5.3 with Firmware 0x800008d3 and updating the driver to 1.6.5 does not resolve the issue (have not tried 1.7.x yet).
Again, all other ESXi hosts are fine and this host starting doing this a week ago. It came out of nowhere. Reboots and power-cycles have not helped. Only this one particular host, which is identical to it's 7 cluster members, cannot get a complete ARP for only 2 out of 20 or so hosts in the same subnet (same L2).
As the ARP packets for these 2 hosts do not even leave the host, I cannot blame the network team this time 😉