Solved: VM CPU Demand metric weirdness

jhboricua · ‎03-09-2017

I'm experiencing an odd issue with two VMs registering high CPU demand vs low usage. In the interest of being thorough, I'll be giving a lot of info of the host and VM configuration below as to avoid generic responses.

vSphere Cluster is composed of 10 UCS B200 M4 hosts, each with dual sockets (E5-2680 v3) CPUs. Each CPU is 12 cores @ 2.5 GHz for a total of 24 pCPUs per host (48 with HT reported by the hypervisor). Each host also has 384 GB of RAM. Host are running ESXi 6 U1. This cluster is dedicated to SQL VMs. As such, memory and cpu in this cluster are not being over-committed in any way. There are several SQL VMs with different vCPU/vMem setups. We have plenty of spare capacity on the cluster.

The two VMs in question are 8 vCPU / 64GB Mem VMs running Windows Server 2012 R2 Datacenter and SQL 2012 Enterprise. Each VM sits on its own host at the moment, meaning they are the ONLY WORKLOAD on the host they live in. We didn't mess with sockets/cores settings in the VM settings so they are the default 8 sockets/ 1 core setup. Thus we have verified the 8 vCPUs in the guest are in a single NUMA node. There are no reservations/limits set on the VM on either CPU or Memory. VM tools is running and current on both VMs. VM version level is 11 (ESXi 6.0 and later).

The issue we are experiencing is that even though the CPU usage on the VMs is averaging 20-25 percent, the CPU demand is pegged at 20Ghz (8 vCPUs x 2.5 Ghz). So vRealize is alerting about it. No other VMs in the cluster is having this behavior, some with the same amount of vCPUs configured, some with more. It's only these two VMs.

The first thing that came to my mind was 'Power Management' is not allowing the Host to give the VM all the CPU power its requesting. However I've verified this is not the case. Everything is setup to High Performance. And I've confirmed it further by doing the following:

If I spin a stresslinux VM on the host with a similar vCPU configuration as these VM, I can bring all the vCPUs to full utilization.
On the actual VMs having the issue, I can spin up two instances of CPUSTRESS from Sysinternals and bring the vCPUs to full 100% utilization (Don't tell the DBA about this).
Heck, even if SQL is NOT RUNNING, CPU demand still won't go down despite CPU usage being 1%.

So I don't understand why the CPU demand counter is pegged at the full 20GHz allocated to the VM when the CPU usage is clearly nowhere near that AND there's nothing preventing the Host from giving the VM all the resources it demands. I've gone over every single setting I can think of and I'm not finding anything different on these VMs from other similarly configured ones NOT having the issue. Again, the VMs having this issue are both sitting on separate Hosts and are the only workload on them. Meaning only 8 pCPUs out of 24 on the host are in use.

Other things I've tried: moving the VMs to other hosts, and rebooting the VMs.

Any insights would be appreciated. At this point I'm simply out of ideas.

jhboricua · ‎04-24-2017

Found it!

These VMs had the Latency Sensitivity set to High (in the VM advanced options). This is what was causing the CPU demand to be out of whack regardless of utilization as enabling this causes the pCPUs used by the VM to essentially become unavailable for anything else on the host (including vmkernel related processing threads), regardless of utilization. Setting this back to normal was the fix.

Now, why this was set to High is a mistery to everyone where I work. None of the previous VM admins/engineers are working here anymore. But considering how over-provisioned these two VMs were vs. their actual utilization, I'm not surprised in the least. It's almost as if the person who built it simply enabled anything that was tagged 'performance' or 'latency' on them just because they could, .

Anyway, I thought I should share my findings. They way I found about this was almost by coincidence. While reclaiming resources from this VM, I could not remove its memory reservation. When I did it, it would not power on. The error basically stated that the memory reservation was needed. Having never encountered this error, I started researching and the kb from VMware (kb2002779) made reference that the error I was seeing was common on virtual machines that have FPT (Full Passthrough) devices, since FPT requires full memory reservation to be set on a VM.

However, there were not FTP devices on our two VMs, so that was confusing. A blog post pointed me in the right direction. Turns out that setting the Latency Sensitivity to High in the VM has the effect of giving exclusive access to the pCPUs assigned to the VM, bypassing the VMkernel (essentially what FPT is), hence the memory reservation requirement.

Hope this helps anyone in the future.

View solution in original post

vcallaway · ‎03-09-2017

I was going to suggest look at your CPU Ready values but if they're the only VMs on each host then I guess we can strike that off the list.

Shot in the dark here but have you checked the Power Settings inside the actual OS? I don't think it would make a difference but I've seen some weird oddity before with 'balanced' vs. 'high performance' within a guest OS.

jhboricua · ‎03-09-2017

TexasForever‌, I forgot to mention that. I did look at the power settings inside the Guest Os. It was set to the default of balanced, just like the other VMs on the cluster. Changing it to Performance had no effect so I left it back at the default.

WallyL · ‎03-10-2017

Another stab in the dark,

What about the power state on the host itself?

From the bios to the host to the guest, all the way through same same settings?

My thought was that the hardware is reporting the ready state incorrectly to the guest.

Regards,
Wally

Dee006 · ‎03-10-2017

Hi,

Have you check the disk performance and disk related metrics, how about your environment using UCS has local HDD and SAN?

jhboricua · ‎03-10-2017

WallyL‌ - Yes, power settings have been verified through the entire stack. As I mentioned on my original post, It is unlikely to be power management related as only these two VMs are affected regardless of which host they live on but the rest of the SQL VMs are not experiencing the issue.

Dee006‌ - Host boot from and have datastores running on XtremeIO arrays. Even if I stop SQL, which is the major consumer of resources on the VM for CPU and Disk, the CPU demand counter won't drop on the idle VM.

jhboricua · ‎04-24-2017

Found it!

These VMs had the Latency Sensitivity set to High (in the VM advanced options). This is what was causing the CPU demand to be out of whack regardless of utilization as enabling this causes the pCPUs used by the VM to essentially become unavailable for anything else on the host (including vmkernel related processing threads), regardless of utilization. Setting this back to normal was the fix.

Now, why this was set to High is a mistery to everyone where I work. None of the previous VM admins/engineers are working here anymore. But considering how over-provisioned these two VMs were vs. their actual utilization, I'm not surprised in the least. It's almost as if the person who built it simply enabled anything that was tagged 'performance' or 'latency' on them just because they could, .

Anyway, I thought I should share my findings. They way I found about this was almost by coincidence. While reclaiming resources from this VM, I could not remove its memory reservation. When I did it, it would not power on. The error basically stated that the memory reservation was needed. Having never encountered this error, I started researching and the kb from VMware (kb2002779) made reference that the error I was seeing was common on virtual machines that have FPT (Full Passthrough) devices, since FPT requires full memory reservation to be set on a VM.

However, there were not FTP devices on our two VMs, so that was confusing. A blog post pointed me in the right direction. Turns out that setting the Latency Sensitivity to High in the VM has the effect of giving exclusive access to the pCPUs assigned to the VM, bypassing the VMkernel (essentially what FPT is), hence the memory reservation requirement.

Hope this helps anyone in the future.

All

VM CPU Demand metric weirdness