We are having a Production data-center with 5 ESXI hosts. There are several Linux, windows and Cisco vms. For couple of months we are facing a strange issue of loosing internet connectivity to several Linux vms from outside until we start pinging from the console of those vms. After this process, the vms are reachable from outside for couple of hours. Again the same issue occurs. Apparently, we are having this issue only on few linux vms only. In order to avoid this problem, we are running a cronjob from those vms to ping their respective gateways at every minute to make the internet connectivity alive.
Please note, the linux vms are running with Ubuntu 16.04 or 18.04 or CentOS 7. All the linux vms are having either Open-vm-tools or VMware managed tools installed.
This particular problem is happening across all the ESXI hosts.
Does anyone have any permanent solution to this problem?
Thanks in advance for your help.
Have you tried switching out the virtual NIC in the VMs for one of a different type?
Thank you for your response.
Because of this issue, at first we had used E1000 or E1000e. As this did not solve the problem, we thought to use VMXNET with Open-vm-tools or VMware tools installed. Currently all VMs are running with VMXNET. But the issue still persists.
Seems kinda like MAC issue - something like your physical switch or router unable to remember or propagate MAC info from your VMs
Next time one of your VMs will lose connectivity please check the following:
1. ARP table on the VM - do you see MAC of the gateway ? do you see MACs of the VMs in the same VLAN subnet ?
2. Can you ping this VM from other VM that is on the same ESXi host and in the same VLAN ?
If the answer for both points is negative you need to go to your networking department - this is L2 connectivity issue and your colleagues need to check switches
You say, that only Linux guests are affected.
If your Windows and Cisco VMs are running on the same hosts and use the same hardware NICs as the Linux VMs I'd suspect an issue with your Linux settings and not an issue on the ESXi side. As already pointed out there should be no hibernating allowed for the NICs inside the VM.
Do you have fixed IP-addresses?
Maybe the gateway drops connections, which are idle for some time and the other VMs either have a better reconnection policy than the affetcted VMs or maintain a stable connection because of permanent traffic or ...?
Thank you for the reply!
I am actually seeing the same thing on the ESXi host itself. I am not able to get to it from Windows and Mac workstations after a certain amount of time (or possibly after the workstations go to sleep). I can see the entry disappears from the ARP table in the workstation. I have to go to the ESXi console and ping the workstation for it to become visible again.
For now, I added a cron job as per the original poster. The cron job just pings my workstation every minute, which reestablishes the connection if ever lost.
It certainly seems like a routing issue - DHCP/DNS appear to be fine. I am trying to track it down, but what boggles my mind is why only the ESXi host is affected while all other devices (IoT, workstations, etc.) are not exhibiting this problem.