@Titanomachia - I'm not trying to be rude and I do appreciate the replies, but literally everything you've asked for is shown in the very first post of this thread.
From what you posted from the vmware.log it looks like numa is okay. You can also double check it from ESXTOP with the metric I posted above. With that said from the previous screenshot the CPU ready time looks okay as its still under 10 which is kind of the universal base line. However there was a few spikes in the previous screenshot where the CPU ready time spiked a bit higher then you would like to see.
I hear what your saying about getting the dev's / management to get past that more CPU is better mentality, but much like the goverment just throwing more money at a problem doesn't always fix the problem I have struggled with this a lot in the past as well but the proof is in the results.
The only way to get around this is to do some testing, which is unfortant as it sounds like it is already in production, which means you'll need some quick outage windows ect. One thing you can do is if your Windows Server is 2008 R2 Enterprise or Datacenter you can turn on CPU / Memory Hot Add. This way when you reduce the CPU / Memory for testing if you don't get the results your looking for you can hot add in more memory / CPU without a reboot. However if your not running an OS version that allows for the hot add it will require more outage windows.
I would say get a baseline of how long your SQL processes take to complete then compare it to when you change the resources around. So if it takes 2min to run a popular query on 32 vCPU but 1min 30 seconds off 16 the proof is there dispite the CPU% inside Windows Task Manager.
This may be a silly question but what is you greatest obsticle, is it that the VM just isn't getting the performance numbers you would like, or is it just the dev team / management worrying that the server is using 70% CPU consistantly.
Here are some PDF's that you can use to help defend your case:
Message was edited by: JPM300
apologies, I missed the last two screenshots. The issue is ready time, its average is over 1600ms, that's very high and is the result of over provisioning the vCPUs. Are you able to power the other VMs off to test?
@JPM300 - I haven't had a chance to read the links, but will certainly take a look!
"but much like the government just throwing more money at a problem doesn't always fix the problem" - well actually to make matters even harder, this IS a government machine. **sigh**
About Hot Add, I believe one of the DBA's found info that even with hot add enabled and we add more CPU's to the VM, while the OS WILL take advantage of the added hosts the SQL processes will not until restarted which equals an outage.
The greatest obstacle as bluntly as possible is the fact developers are not communicating, I have no idea how or what they are testing. From what I've managed to pull from them is they are running some kind of benchmark and seeing metrics of some kind but believe they should be faster. That really sums up exactly what I know and as I'm sure you'll see, this doesn't help me and I'm sure not you as well. That said the developers are going to management stating they need more CPU. Management and Devs both are seeing the "high" CPU usage in Windows and agreeing and now wanting to throw more resources at it. Is this in production? Yes, of course it is.
I will get some ESXTOP stats and if my request to be available when they perform benchmarking again I'll get ESXTOP then too.
It's convincing management and the developers it likely isn't. Because they are stuck looking only at Windows task manager and seeing it's using 60-70% CPU. At this time this is our "slow" time and are expected to be busier in the next month or two. Their train of thought is "if it's at 70% now we're going to need a lot more CPU's when we are busy in a month".
I wholeheartedly believe we don't need 32CPU's assigned to the VM. But maybe with the help from here I can either prove that point or be proven why I'm wrong.
Educating the management and end users is always the challenge - they need to understand that a VM pulls resources from a shared pool of resources and will only pull what it needs - so in your case the VM is not being constrained by the lack of resource on the ESXi host and is only using ~50% of what is assigned to it and if the VM needs more CPU cycles it will it will be delivered up too the limit of what is assigned (i.e. 32 cores of 2.399 GHz). Currently it is receiving 32 cores of ~1.2 GHz and the OS is indicating that it is using all of this -
Lol well no worries I was just using an anology however guess I hit that nail on the head
non the less there is ways to still sort this out and get the numbers you want.
Keep in mind there is a very small overhead in virtulizating SQL and its about 10%, so if there numbers are showing 10% it could be withing the overhead thershold. Also I belive you are correct with SQL hot add CPU as the SQL server won't see the new CPU unless the SQL instance is restarted or the server rebooted. Either way its an outage to SQL which isn't great.
I think you will need to sit down with the Dev team and management in the same room and get the numbers on what they are testing or what they want to see. Without knowing what they consider "poor performance" is it will be near impossible for you to try and sort it out.
WIth that said post a screenshot of the ESXTOP results and we will see what we can dig out for you. Hopefully we can find the source of the bottle neck or at the very least give you enough information to take back to management to help with the resource creep issues.
Working in 5.0 environment (Much like org poster my guess from date of post) .. Oh can't wait finish upgrade to 6.0
But I am seeing the same thing, my theory?
32 vCPU mixed in with other workloads is most likely causing high CPU RDY%
So question is how does windows task manager show CPU RDY/CO-Stop? All windows can see is waiting for CPU cycles, would that not cause windows report that CPU usage is higher?
To test this we moved the VM that showing very high usage in windows to host with no other vms, result both vCenter and windows task manager being aligned better.
Why does this matter? As vROPS becomes the a tool for performance monitoring, it will be compared to agent base products that get information right from guest OS not vCenter. This bringing questions to the accuracy of vROPS or vCenter data.
Yeah this is the problem with Windows, it doesn't report CPU ready/Co-stop well in taskmanager. However Perfmon will show new counters you can report on once VMware tools are installed. These can help get a better idea of what is happening at the hypervisor level opposed to the OS level which isn't aware of what is happening under the hood.
In the past VCOPS is a really great tool for this coupled with an application level monitoring tool like SCOM or others to help with the application level. You can even bring SCOM snapin's into VCOPS for direct compairsons, with this information you can pin point problem area's or go back to business units with proof it isn't the virtual environment that is causing their issues. Many times the networking/visualization team are guilty until proven innocent
Hope this helps