Re: Sudden network problems (packet replay?)

clk_admins · ‎02-14-2022

Hello everyone,

we are currently experiencing some very strange network problems. Since about 1,5 weeks ago some of our employees are complaining about lost connections to their PC via Remote Desktop. Because of the current situation, most of our employees work from home over an OpenVPN server which is running on one of our ESXi servers. After a bit of research (a lot of pings to googles DNS from different maschines in our network) we determined that the problem must come from our ESXi server, as we saw some strange behavior on all VMs running on the server. Here is an excerpt of one of those pings:

[Di 1. Feb 15:04:24 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3596 ttl=117 time=4.50 ms
[Di 1. Feb 15:04:25 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3597 ttl=117 time=4.68 ms
[Di 1. Feb 15:04:26 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3598 ttl=117 time=4.81 ms
[Di 1. Feb 15:04:27 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3599 ttl=117 time=4.74 ms
[Di 1. Feb 15:04:28 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3600 ttl=117 time=4.56 ms
[Di 1. Feb 15:04:29 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3601 ttl=117 time=4.68 ms
[Di 1. Feb 15:04:30 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3602 ttl=117 time=4.54 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3603 ttl=117 time=4.51 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3604 ttl=117 time=4.53 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3605 ttl=117 time=4.69 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3606 ttl=117 time=4.51 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3607 ttl=117 time=4.54 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3608 ttl=117 time=4.52 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3609 ttl=117 time=4.50 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3610 ttl=117 time=4.59 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3611 ttl=117 time=4.72 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3612 ttl=117 time=4.53 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3613 ttl=117 time=4.66 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3614 ttl=117 time=4.55 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3615 ttl=117 time=4.62 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3616 ttl=117 time=4.71 ms
[Di 1. Feb 15:04:45 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3617 ttl=117 time=5.12 ms
[Di 1. Feb 15:04:46 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3618 ttl=117 time=4.50 ms
[Di 1. Feb 15:04:47 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3619 ttl=117 time=4.55 ms
[Di 1. Feb 15:04:48 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3620 ttl=117 time=4.65 ms

Normally we should get one reply each second but then suddenly the replies just stop for a couple of seconds, before coming all at once. This would explain why our employees are experiencing connection loss on the VPN side.

I then went on to check the logs/events on the ESXi host and found these messanges:

Der Zugriff auf Volume 5776ab59-b2c08ac9-32d4-0cc47aabf506 (datastore2) wurde nach Konnektivitätsproblemen wiederhergestellt.	Montag, 07. Februar 2022, 18:46:55 +0100	Warnung	Keine
Der Zugriff auf Volume 5776a676-6cd2e353-9860-0cc47aabf506 (datastore1) wurde nach Konnektivitätsproblemen wiederhergestellt.	Montag, 07. Februar 2022, 18:46:55 +0100	Warnung	Keine
Wegen Konnektivitätsproblemen kann nicht mehr auf Volume 5776ab59-b2c08ac9-32d4-0cc47aabf506 (datastore2) zugegriffen werden. Es wird versucht, eine Wiederherstellung durchzuführen. Das Ergebnis liegt demnächst vor.	Montag, 07. Februar 2022, 18:46:41 +0100	Warnung	Keine
Wegen Konnektivitätsproblemen kann nicht mehr auf Volume 5776a676-6cd2e353-9860-0cc47aabf506 (datastore1) zugegriffen werden. Es wird versucht, eine Wiederherstellung durchzuführen. Das Ergebnis liegt demnächst vor.	Montag, 07. Februar 2022, 18:46:41 +0100	Warnung	Keine

They basically say, that the connection to our datastores was lost, and its trying to reconnect. The timing of these messages and the length of connection loss fits with the timing of the network problems.

We suspected a hardware problem at first so we switched to our backup ESXi server. We do nightly replication to that server. With every VM startet on the new server, we experienced the same problem. So the theory of it being a hardware problem seems highly unlikely now, as it is super unlikely for both systems to have the same hardware problem all out of a sudden.

Later on we moved the VPN VM (and only that VM) back to the old ESXi host and the problems (atleast for the OpenVPN VM) were gone. No connection problems were reported anymore.

Did any of you ever experience similar problems? Does anyone have an idea what is going on? I feel like I am going nuts trying to find a credible explanation and a solution... If I can provide anymore information, just let me know 🙂

Thanks in advance and kind regards,

Jan

sramanuja · ‎02-14-2022

There is a known issue in vSphere 6.5 and 6.7 in which slow storage operations can cause ESXi hosts to go offline: https://kb.vmware.com/s/article/1003659

I would recommend following the steps in the KB to troubleshoot the issue.

This issue has been fixed in vSphere 7.

clk_admins · ‎02-14-2022

Thanks for the information. The KB article only references network storage, we are using local storage connected via a raid controller. I am unsure if we can Update to vSphere 7. We have a vSphere 6 Essentials kit (with valid subscription).

Thanks again for helping.

sramanuja · ‎02-15-2022

AFAIK it applies to any slow I/O operations, which could apply to local storage as well. In most cases it used to be tied to operations like backup which would result in the slow I/O operations.

If you have a vSphere 6 license, I don't think you can upgrade to vSphere 7 unless a new license is purchased.

clk_admins · ‎02-16-2022

Once again thanks for helping us. I will start/am doing some Performance profiling and will have a look, if the issues where connected to spikes in I/O heavy operations.
Yeah, that's what I would have guessed too. Will look into upgrading the license then (on the technical side the servers are compatible with vSphere 7 AFAIK).

All

Sudden network problems (packet replay?)