Ever since installing Windows 2016 and 2019 VMs into my VMware ESXi 6.7/6.5 environment, we've been having random connectivity issues with a number of these VMs. It's very bizarre. Basically services stop functioning (e.g. anything relying on SQL connectivity or file share access from another VM). Some cases, the destination VM is a 2012 R2 VM, but it always seems to only affect these 2016/2019 source machines. Often these VMs are on the same subnet as each other, other times they cross subnets.
So far the symptoms are SQL timeouts (references to semaphore timeouts) or file shares inaccessible. When this is occurring, I can ping back and forth between the systems and telnet sessions to 1433 and 445 connect fine. However attempts to access an actual file share leads to the typical old school Windows error "network name is no longer available". For example:
The specified network name is no longer available.
I've performed a number of troubleshooting steps over the past few months to isolate the problem with little success. Each of the following failed to resolve anything. I've rebooted the VMs. I've migrated them across different ESXi hosts and/or vmnics. I've even migrated the VMs to the SAME ESXi host and even same vmnic. Even in this last scenario and when VMs are on the same subnet they STILL could not talk to each other. I mean that alone rules out most of our environment as a potential culprit I would think.
I finally found something that did resolve the issue, albeit temporarily. At one point, grasping at straws, I suspected a potential MAC address problem. I added a new vNIC to the system (to ensure it got a new MAC from VMware), migrated the IP over to it, and deleted the old vNIC, and magically connectivity was IMMEDIATELY restored. Until it broke again later.
The next time it occurred, I found that if I simply change the MAC within the advanced properties of the NIC within the OS, I could achieve the same success. I generated a random mac address, and applied it, connectivity was again restored. I found that even if I immediately reverted that new MAC back to the original MAC assigned by VMware, the connectivity remained. So the issue is clearly not with the MAC itself, but something that relies upon it. In fact, in one scenario, a simple clearing of the ARP table of the destination VM fixed the issue, but that's only worked one time. Most often I have to change the actual MAC but it seems random as to which machine changing the MAC actually works on. Sometimes it's the source, sometimes its the destination, other times it requires BOTH to be changed.
Usually it's the same exact VMs that continue to be re-affected. But now the issue is spreading to other VMs that previously never exhibited the issue!
Obviously I can't keep changing MAC addresses. I need to find a real fix but I'm at a loss.
I've searched high and low and I cannot find any definitive answer or anyone that has seen this exact same problem. I've seen some mentions of issues relating to VM tools version, but we're on the latest for our environment. I've seen some mentions of use of the VMXNET3 adapter type, which we use exclusively though they've never been problematic on our 2008/2012 VMs. I could switch them, but I'd lose my 10GbE I believe if I do. Even still I won't know if that actually fixes anything until weeks/months down the road. I've also seen mention about some gratuitous ARP issues where Cisco switches are involved. We do use Cisco. With that said, those articles suggest a symptom of duplicate IPs being reported, but we never have that. Further as mentioned I believe I've removed our switches from the mix by having VMs on the same host and subnet.
I'm fairly certain we don't have duplicate MAC addresses at play here. I've queried the entire VM environment and show unique MACs are assigned from a VMware perspective. Also these MACs never change and things work for a time but then stop, and after fixing, I can revert back to the original MAC and issue remains resolved.
So I'm at a complete loss. Anyone have any ideas??? I found a Reddit post with a seemingly related issue where it was determined, as of a month ago, VMwares is aware of the problem and intends to fix in a later patch for VMware 6.7, but no further information was provided for me to follow up on. I intend to call VMware now, but I'm sure I'll get stuck in a level 1 support hell who will simply try to repeat everything I've already done.
Edit: One thing that really confuses me is how ping and telnet work, if ARP is supposedly busted as I would expect those tools require proper working ARP to perform their functions. Am I wrong?