Solved: High ping latency even when pinging localhost

Th3Judg3 · ‎02-08-2015

Hello,

Since a couple of days, maybe more than a week we're facing with some strange ping latency.

At the beginning we thought that it has something to do with our network equipments but after a few hours of investigation we come up with the following conclusions:

1) Pinging VM to VM(both hosted on the same ESXi host) result in an average > 1.5ms with spikes till 50ms out of 100 pings at default interval.

2) Created two VMs on a separate vSwitch without any dedicated NIC and tried a ping between them. The result was better but not good enough - average > 0.5ms with spikes till 3ms and even 10ms

3) Pinging ESXi management interfaces from other devices on the same LANs revealed a good ping latency - average around 0.2ms with spikes till 1.7ms

4) Pinging devices on the network from the ESXi console itself(from SSH) showed us a latency higher than expected - average > 0.6ms with spikes till 5ms

5) The interesting part: ping to localhost from ESXi console - average >0.3ms with spikes till 2-3ms

We thought that there might be a contention/bottleneck somewhere on the ESXi but couldn't conclude this, not yet at least. The CPU usage is around 65-80% with spikes till 85% in esxtop. Can this be the cause of our issue? Here is an output from esxtop:

PCPU USED(%): 59 61 52 63 74 59 68 74 AVG: 64

PCPU UTIL(%): 60 61 53 63 75 60 69 74 AVG: 64

ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP %CSTP %MLMTD %SWPWT

1 1 idle 8 275.55 546.00 0.01 0.00 - 224.18 0.00 7.03 0.00 0.00 0.00

8 8 helper 86 97.66 99.76 0.00 8101.96 - 51.99 0.00 2.24 0.00 0.00 0.00

1786346 1786346 FreeBSD9_037 10 71.96 71.78 2.37 771.33 1.69 58.69 196.71 2.49 53.55 0.00 0.00

6218425 6218425 FreeBSD9_152 8 69.71 70.18 2.55 657.31 0.43 35.38 85.34 3.17 0.36 0.00 0.00

4825332 4825332 webhosting01.wh 12 41.80 39.86 3.44 1070.75 0.43 36.05 305.25 2.12 1.45 0.00 0.00

6363251 6363251 esxtop.36586035 1 20.00 19.33 0.00 75.09 - 0.11 0.00 0.04 0.00 0.00 0.00

5587218 5587218 CentOS5_148 10 17.43 15.48 1.89 907.99 1.08 32.95 333.24 0.62 0.00 0.00 0.00

1528430 1528430 FreeBSD9_116 8 17.00 16.97 0.81 707.03 0.13 39.60 134.28 0.95 0.19 0.00 0.00

4108400 4108400 FreeBSD9_140 8 13.54 13.67 0.38 725.60 4.52 22.63 146.88 0.57 4.08 0.00 0.00

1884461 1884461 FreeBSD9_134 8 12.79 12.49 0.67 738.98 0.18 13.53 165.37 0.53 0.00 0.00 0.00

6112231 6112231 FreeBSD9_143 7 12.24 11.99 0.96 647.13 0.00 7.75 75.78 0.79 0.00 0.00 0.00

4409984 4409984 Win7_128 8 9.06 9.20 0.04 742.67 0.02 3.87 176.57 0.25 0.00 0.00 0.00

6285951 6285951 Unattended_Depl 9 8.73 8.01 0.84 835.92 0.04 6.54 174.64 0.48 0.00 0.00 0.00

helper process is using quite much CPU but I have no ideea how to debug this process further or could be the cause of this.

There is no bottleneck/contention on the network side.

Our setup is pretty simple:

One ESXi 5.1.0 build 1065491 running on an old HP DL585 G2 with 8x Opteron 8218 CPUs and 64GB RAM. The host is connect to the rest of the infrastructure via two switches: one gigabit switch for production and reachable from outside(public IP addresses) and one 100Mbps gigabit for internal management using private IP addresses. One NIC is connected to each of these two switches and we're using vDS. Two management/VMKernel interfaces - one on the public interface and one on the internal interface. Customer's VMs are in the same LAN/network with the public management interface, no VLANs.

For storage we are using SAN - EMC Clariion CX3-20 - connected to the ESXi server via 2xBrocade switches running at 4Gbps.

If someone had similar issues or if you have any ideea what could cause such latencies I would appreciate a little help:)

Regards,

Raul

MKguy · ‎02-09-2015

As mentioned by Jon before your CPU %RDY and some %CSTP is way too high. I suspect this might be one of (if not) the main cause. From what I can tell from your esxtop paste, you seem to be running (at least) around 24 vCPUs in an old server with 8 physical CPU cores. As a general rule of thumb, you should aim for %RDY below 5 per vCPU.

If you can't update the hardware or throw in another server to better balance the load, then you should reduce the number of vCPUs on your VMs. Check what your VMs really need and adjust the number of vCPUs accordingly.

An update to at least a recent 5.1 release of course won't hurt, but I doubt it will get rid of the high CPU ready time values if you keep over-provisioning vCPUs on this host.

-- http://alpacapowered.wordpress.com

View solution in original post

jrmunday · ‎02-08-2015

Hi Raul,

There are two obvious thing that stands out below;

The DL585 G2, with AMD Opteron 82xx series processor is only supported up to 4.1 U3
The %RDY time is excessively high. Given that your host is only averaging 64% CPU utilization, I would be included to say that these VM's are over provisioned and struggling to be scheduled on the underlying physical cores.

Even though it's not supported, it may still work and I would do the following;

- Upgrade hardware BIOS / firmware to the latest version

- Ensure that the BIOS is configured for Maximum performance (ie. no power saving features)

- Patch ESXi to the current build, including NIC driver (VIBs)

Once this is done, I would test ping latency starting with a single VM (single core) and increase the CPU load (keeping an eye on %RDY time) to identify how this impacts latency.

I see most of your VM's are flavours of *NIX, but here is and interesting KB article related to Windows systems that display this behavior;

VMware KB: Poor network performance or high network latency on Windows virtual machines

Cheers,

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77

Th3Judg3 · ‎02-08-2015

Hi Jon,

Thanks for your reply.

Will try what you proposed but it will take some time since all those actions requires downtime. Do you recommend an upgrade to vSphere 5.5 or only an update to the latest build of vSphere 5.1?

Regards,

Raul

jrmunday · ‎02-08-2015

Hi Raul,

I would upgrade to the latest 5.1 build in the first instance (this should be quick without too much hassle).

In parallel, I would verify if anyone else has 5.5 installed and working on this unsupported hardware before considering an upgrade.

Cheers,

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77

MKguy · ‎02-09-2015

As mentioned by Jon before your CPU %RDY and some %CSTP is way too high. I suspect this might be one of (if not) the main cause. From what I can tell from your esxtop paste, you seem to be running (at least) around 24 vCPUs in an old server with 8 physical CPU cores. As a general rule of thumb, you should aim for %RDY below 5 per vCPU.

If you can't update the hardware or throw in another server to better balance the load, then you should reduce the number of vCPUs on your VMs. Check what your VMs really need and adjust the number of vCPUs accordingly.

An update to at least a recent 5.1 release of course won't hurt, but I doubt it will get rid of the high CPU ready time values if you keep over-provisioning vCPUs on this host.

-- http://alpacapowered.wordpress.com

Th3Judg3 · ‎02-10-2015

Yep, seems that our host is much too overcommited. Although CPU usage is in an acceptable range, CPU %RDY is way too high and also %CSTP is high for some VMs. All that is left for us is to either replace the actual host or to add another one.

Thank you both for your help!

Regards,

Raul

ceciliok · ‎02-15-2023

Hello Team,

I have a similar problem , my latency is not so high as it ranges between 0.1 - 0.3 but am experiencing network jumps of upto 15ms.

I as well thought that there might be a bottleneck somewhere on the ESXi but thats not the case The CPU usage is around 7.2.

Kindly support

Regards

Cecilio.K