bsinha951994
Contributor
Contributor

vSphere vm becomes unreachable from outside suddenly, until start pinging from the console

Hello,

We are having a Production data-center with 5 ESXI hosts. There are several Linux, windows and Cisco vms. For couple of months we are facing a strange issue of loosing internet connectivity to several Linux vms from outside until we start pinging from the console of those vms. After this process, the vms are reachable from outside for couple of hours. Again the same issue occurs. Apparently, we are having this issue only on few linux vms only. In order to avoid this problem, we are running a cronjob from those vms to ping their respective gateways at every minute to make the internet connectivity alive.

Please note, the linux vms are running with Ubuntu 16.04 or 18.04 or CentOS 7. All the linux vms are having either Open-vm-tools or VMware managed tools installed.

This particular problem is happening across all the ESXI hosts.

Does anyone have any permanent solution to this problem?

Thanks in advance for your help.

10 Replies
scott28tt
VMware Employee
VMware Employee

Have you tried switching out the virtual NIC in the VMs for one of a different type?


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
0 Kudos
bsinha951994
Contributor
Contributor

Thank you for your response.

Because of this issue, at first we had used E1000 or E1000e. As this did not solve the problem, we thought to use VMXNET with Open-vm-tools or VMware tools installed. Currently all VMs are running with VMXNET. But the issue still persists.

0 Kudos
ZibiM
Enthusiast
Enthusiast

Seems kinda like MAC issue - something like your physical switch or router unable to remember or propagate MAC info from your VMs

Next time one of your VMs will lose connectivity please check the following:

1. ARP table on the VM - do you see MAC of the gateway ? do you see MACs of the VMs in the same VLAN subnet ?

2. Can you ping this VM from other VM that is on the same ESXi host and in the same VLAN ?

If the answer for both points is negative you need to go to your networking department - this is L2 connectivity issue and your colleagues need to check switches

bsinha951994
Contributor
Contributor

Thank you for your suggestion.

We shall follow your instruction, when we find one of the VMs unreachable until next time.

0 Kudos
elfox
Contributor
Contributor

Hi, I am currently facing the same issue. Did you ever figure out the root cause/solution?

0 Kudos
patilraksh42
Contributor
Contributor

Hello,
Could you please check network awake setting from Linux OS, it may suspend due to inactivity.

Thanks 

Rakesh

0 Kudos
pwolf
Enthusiast
Enthusiast

You say, that only Linux guests are affected.

If your Windows and Cisco VMs are running on the same hosts and use the same hardware NICs as the Linux VMs I'd suspect an issue with your Linux settings and not an issue on the ESXi side. As already pointed out there should be no hibernating allowed for the NICs inside the VM.

Do you have fixed IP-addresses?

Maybe the gateway drops connections, which are idle for some time and the other VMs either have a better reconnection policy than the affetcted VMs or maintain a stable connection because of permanent traffic or ...?

0 Kudos
elfox
Contributor
Contributor

Thank you for the reply!

I am actually seeing the same thing on the ESXi host itself. I am not able to get to it from Windows and Mac workstations after a certain amount of time (or possibly after the workstations go to sleep). I can see the entry disappears from the ARP table in the workstation. I have to go to the ESXi console and ping the workstation for it to become visible again.

For now, I added a cron job as per the original poster. The cron job just pings my workstation every minute, which reestablishes the connection if ever lost.

0 Kudos
pwolf
Enthusiast
Enthusiast

That sounds like a switching or routing issue. If the servers and the workstations are on different subnets I'd look for routing issues otherwise it is probably a switching problem. That the ARP- tables on the workstation are cleared after a certain amount of inactivity is normal and no sign of problems, but that you can reach a host only, if you receive traffic from the host is a clear error situation. If the workstation knows the IP-address of the host, that is, if DNS is working correctly, your workstation would ask for the MAC-address of this IP-address and either the host or a router in the way would answer that. So you should look for the ARP-tables of the switches on the way to your host. But I once had a similar problem with NPAR enabled adapters of the Marvell FastlinQ 41000 Series on an HPE server. In the end I gave up on NPAR and disabled that feature in the hardware settings of the NIC. This had also the advantage, that this is much better to configure for cluster, vmotion etc. as physical NPAR functions bound to a specific VM.
0 Kudos
elfox
Contributor
Contributor

It certainly seems like a routing issue - DHCP/DNS appear to be fine. I am trying to track it down, but what boggles my mind is why only the ESXi host is affected while all other devices (IoT, workstations, etc.) are not exhibiting this problem.

0 Kudos