VMware Cloud Community
mlombz
Contributor
Contributor

ESXi 6.5 U2 cpu latency problems (Dell R720+GridK2)

Hi to all,

             I'm opening this discussion because i'm facing a very boring problem with ESXi 6.5 U2 (Customized Dell Image) on a Dell R720 Host.

The host indicate above is configured with 2xE5-2690, 128GB DDR3 Ecc, 1x1TB, 3x2TB (RAID0), 1x Nvidia Grid K2 and also with 2 x 1100w

redundant PSU and GPU Kit to be compliant with the GPU support of Dell.

From the point of view of the BIOS / Firmware everything part of the machine is up to date (last BIOS version 2.7.0 of August 2018) and the

settings in the BIOS are all set to be "High Performance" as indicated in the Dell R720 BIOS manual for High Performance hosts that have to

run a Virtual Desktop Infrastructure with a GPU installed. As indicated in the beginning the OS image is the ESXi 6.5 U2 (Customized Dell Image)

added to a vCenter Server 6.7 (Linux Appliance) that is running on the same host (default vm with 2vCPU, 10 GB RAM for small sized virtual environment).

Everything is "practically" working, there are no errors or evident configuration mistakes from my point of view. I was also able to create multiple

virtual machines and add the vGPU profiles, install the Nvidia drivers and run 3D graphic applications such Solidworks or Ansys, but this is not the

problem due to the fact that the latency grows up also without involving the vGPU (apparently is something that is not GPU dependent).

Now the problem object of the thread is something that i'm unable to predict or reproduce, practically a couple of days ago everything was working

ok but at a certain moment (not connected to any kind of event or system error) all the virtual machines on the host start to "stutter" in a more or less

repetitive way (every 8/10 seconds), this latency phenomenon was also clearly visible through the VDI machines in which the 3D applications were

running because they start to have massive FPS drops and high CPU peaks in the same moment.

Firstly i was thinking about some problems with the I/O communication with the Nvidia Grid K2 but what i found is that if i stop all the VMs running

on the host and i leave only the vCenter (2vCPU, 10GB RAM) and the Domain Controller / DNS (Windows Server 2016 Datacenter, 2vCPU, 2GB RAM)

the latency appears in the same way on all the virtual machine and also on the host.

I can clearly view this problem by monitor the CPU Latency of the host through the vCenter monitor tab, what i saw is that two days ago there was

completely no latency (a straight line at 0, so zero latency every time and no stuttering inside the vms, no cpu peaks, no network delay, no fps drops

in the 3d applications) and at a certain time it starts to oscillating, with a noisy trend, between 10% and 0.1% with peaks that sometimes reach 60%!

I have tried to reboot the host, review the BIOS configuration, update all the firmware but all without success, the latency doesn't reach the plateau

behaviour anymore. It is funny because yesterday afternoon the host worked with the zero latency for a couple of minutes and after it starts again

to behave like described above, since now we haven't reached again the working condition.

I have also tried to look at the ESXTOP tool through the ssh and what i saw is that all the vms show high value of %LAT_C and oscillating value of

%VMWAIT. I have looked to a couple of thread about the possible problem induced from a problematic storage but i didn't find something of strange

inside the disk panel, but however if you will need screenshots / report of this to answer me guys i will upload in the necessary condition.

Thank you very much to everyone for the support and the help you can give me, sorry also for the long text i hope that i was able

to describe the problem without commit big mistakes.

Thanks and best regards to everyone.

Marco

0 Kudos
2 Replies
mlombz
Contributor
Contributor

Hi to all,

            I want to add an update on this thread because we have found the origin of the problem. Today we have deep look into the ESXTOP tool

and what has been found is that ESXi is not following the "Maximum Performance" profile as set inside the Power Management configuration

panel. As i mentioned before the BIOS has been set up firstly with the embedded "Maximum Performance" profile and also with the actual

setting which is the "OS DBPM" (ESXi control C-States and P-States).

With this last option what i have seen is that, with the "Maximum Performance" profile, ESXi put the CPUs into the %P0 state and do not use

the C2 state (only limited to C1 and C0) and this gives a perfect working condition where there are no CPU latency (as explained before) neither

FPS drops in the VMs with the vGPU installed.

Now the big problem starts here: this is working when there is no "high" CPUs activity, so what happens is that practically once the CPUs start

to working a little bit more than 5% / 10% everything work for a couple of minutes (fixed P-States on %P0, which is the maximum one available

in the E5-2690 corresponds to 2900Mhz, no latency and no stuttering) but after these minutes the behaviour inside the Power Panel in ESXTOP

tool becomes different from what i expect from the "Maximum Performance" profile. Practically ESXi starts to move all the cores through all the

available P-States since each core has reached the minimum frequency P-State (P15 for E5-2690 which corresponds to 1200Mhz) and then it

moves them again on the maximum P-State P0 (2900Mhz) and causes the phenomenon of massive fps drops in the application, the network

latency, the cpu latency and so on.

What is causing that ESXi is unable to maintain the maximum perfomance P0 P-State? Is this a problem of BIOS / OS P-States control fighting?

Thanks again to all for the help and the support.

Marco

0 Kudos
mlombz
Contributor
Contributor

Hi to all,

            anyone have ad idea about how to solve this problem? I have tried a couple of things in these days, first of all i have reset the BIOS to the default settings and i have set again TurboMode, C1 and C1E to Disable to reduce the possibility of CPU throttling. I have set the thermal profile of the iDRAC7 to Maximum Performance and i have checked that no Power Cap is enabled, the Power control setting in the BIOS is set to OS DBPM such that the ESXi 6.5 U2 can control the P-States of the CPUs.

Now what i have seen from the ESXTOP Power Panel is that if i set "High Performance" everything works for a time span of about an hour (the CPUs are in P0 states with maximum frequency and no throttling so no FPS drops or latency) but after this period the power management seems to stop working and the CPUs start a crazy variation of the P-States (more or less randomly) without care about the computation load and the FPS in the vGPU machines starts to drops every seconds and also the CPU latency grows up and oscillating with an average of 60%.

What i found is that if i put "Balance" in Power Management (through vCenter) everything stop to act crazy and goes on the right way (idling CPUs went to minimum frequency until the load on them don't increase, but this profile is not for us because it introduces too much FPS drops and latency in the VDI virtual machine with vGPU) but if i set the Power Management again to "High Performance" or "Custom" profile with P-States disabled, everything start to act "crazy" again with strange P-States selection method.

For example for a couple of minutes one CPU (16 cores) are in the P0 state and the other CPU (16 cores) is in P15 state (lowest frequency), at a certain time, without apparent logic, the CPU at P0 state goes to P7 and it becomes stuck here for a couple of minutes where the other CPU at P15 state goes to P1 and back to P15 like "bouncing" between the P-States, everything without care the computational load.

I'm sorry for the long text i have put here but i have no idea to how to solve the problem or how to explain the phenomena, i hope someone could give me support because we really don't know of to proceed.

P.S. I have also tried to put in the Kernel Boot options the parameters intel_pstates=disable and intel_idle.max_cstate=0 to force the deactivation of the power management but again it works only for a while.

Thanks again to all.

Marco

0 Kudos