Moderate CPU Usage / Ready Time

woodycollins · ‎10-24-2011

So I am sure this has been asked many times over but being a fairly new ESX administrator in our company I wanted to verify my understanding with you online experts. In the farm we have there is a 5 node cluster (one of a few clusters). Each host in the cluster contains 2 sockets with 8 cored CPU's (80 total logical processors) and 256GB ram. Across the cluster we have 225 VM's. I've been monitoring our cluster for a good bit of time now and from my observations I am seeing the average host CPU utilization around 40%, ready times that range widely between less than 1% and up to 40%, and absolutely no memory contention. Infact about 30% of the VM's in the environment are showing a ready time of 8% or higher on average.

I guess the first question I have would be, is there any correlation of cpu usage (average) to ready time at all? For example should there be a higher percentage of CPU usage when the VM ready times increase or not? At first it seems fairly odd that I would have such high ready times but low CPU utilizations but the more I think about it the more it would make since that if the VM's are only using 40% of the cpu's overall speed for the time period that the VM has been granted CPU time you could run into a low to moderate cpu usage and high ready time scenario. Is this correct or is there something I am not seeing / understanding on why we would see such high ready times but low average cpu usage.

Also, am I correct in calculating we are getting a 3.7:1 virtualization ratio if I take and add all the vCPU's up and divide it by the pCPU count in the cluster?. And if that is the case does that seem low to others? I really don't have anything to compare things too.

rmathis · ‎10-24-2011

Sounds like you might have done a little over provissioning of the VM's (Not enough info to tell). What type of guest OS are you running general spec's CPU, Memory, Function would make it easy. Setting your VM's to have more resources will create these results. A file server for example with 2 lightly accessed share's wouldnt need 2 CPU's and 4GB RAM. Using a smaller VM will take care of some of the ready times. This goes agiants many years of the traditional method Less is More in this case. This is a little weird but just becuase a windows box says in task manager its running at 60% dosent mean it's the same as a physical box and slap in another CPU. A well defined VM is a fast VM.

And the last question yup your not using the hardware to close to max potential. It depends on load and function which is why I ask about what they do. A low use VM you can get 20 virtual CPU's to a single physial core for a heavy use you will get less. I have 4 host in a cluser and 130 VM's and still have room to spare. That's general file, print, a few db boxes, some network managment, and a lot of win xp and 7 boxes with almost no strain on the host. My DBA loves me and yet I limit his butt lol.

If it get's to bad you can set DRS to keep the high usage machines apart which could help the host ability to service all machines.

Check out Matt's post he's pretty damn good.

http://communities.vmware.com/thread/130310

Not poking at you with the link I did the same thing when I first started. There's thousands of things you will learn and not just VMware related you'll see Windows and Linux in a new way. You'll start to see how fast you can make these VM's it's a new drug man watch out it catches on quick

woodycollins · ‎10-24-2011

Thanks for the reply. To gve you a little more information. We have 18 4vCPU systems, 22 2vCPU systems and the rest are 1vCPU systems. Ram counts per machine range from 2GB - 12GB. Almost all are running Windows 2003/2008. Sadly this farm was enherited this way a couple weeks ago and from what I can tell the VM's are WAY over provisioned for memory and the 40 or so VM's with multiple vCPU's can be dropped down to a single (improving ready times). I am working on getting the multiple vCPU systems dropped down to single vCPU's but the problem I am running into is selling the associated business units on removing resources from there machines. It's hard to tell a developer thats always had a hardware box that he doesn't need 4 CPU's for an application in our company.

Also I guess I would need a better understanding on what a "low" use vM compared to a "high" use VM is. Is there any documentation or values that define low,med,high expectations / descriptions?

One of the biggest problem that were running into is in the cluster I described, someone some were decided it would be ok to throw desktops on. And the average desktop user appears to be using around 800 - 1000 Mhz per VM. To me that would be classified as a "high" use? I'm just now finishing up a script to pull usage stats so I can get a better idea on average usage and understand the VM's themselves better.

Any additional information on capacity planning would be helpfull. I don't take offense easily so don't worry about that

rmathis · ‎10-24-2011

Yup your over lol. Watch a few preformance charts over the past month on some of the 4CPU machines and see about getting those down first. That will be the biggest cut back. If they sit at less then 30% then it's very easy to justify dropping down maybe even steping back to a triple core to see if you can kick up the precentage used vs idle.

Are you under an SLA for specific hardware requirments per vendor and application? I have yet to find an app that uses all 4 when it says it will. To be honest I dont know of any official guidelines on the subject but I would try to push the idea of less is more and only useing what is required. If your in a pay per resource setup then look into a charge back and present what they could be saving by cutting back to what they use.

On the desktop machines that useage tells me it was a convert from a physical box to a VM and not a fresh build. It might also have some run away app's on it or services like a SQL desktop engine, virus scans could of got stuck, or even indexing. Might want to look for a VMware View server someware in that setup.

It's tricky taking things away when people are use to what they have. If your the sole admin or even a small group it should be easy cough and shutdown and crap where did that other CPU go... Most users wont notice right away and relay on task manager. You could also trick it and put in a resource pool on its own and limit its resources and see if it pegs and they complain. My guess is they wont for awhile.

"Sadly this farm was enherited this way a couple weeks ago"

Sorry to hear man it's never good following in anothers foot steps when it wasnt done clean and smooth the first time around. Makes your job a lot harder then it should be. Afraid you might be stuck with politics more then anything at this point.

woodycollins · ‎10-25-2011

We are under no SLA requirements that I am aware of. Just old mentality. And at the moment we are not doing any type of charge back. Though we have pitched that idea before and I don't think its going to go anywere. As far as the desktop machines performance. They are clean builds, however they are based on deployment to a hardware based platform. I know there has been no performance tuning that could be done to improve the performance of the virtual machines in our environment (drives me crazy) and now that I am working on it I hope to actually change that.

I have taken the initial steps to decreasing the multi vCPU machines and that will take time when working with others in our company but according to all the calculations for workload expectations to our hardware base, I should only expect a 3:1 ratio. Not some 20:1 number your indicating.

Some questions on your environment. What physical hardware are you running? And can you give me an idea of your VM configuration and performance counters such as average cpu usage and ready times? I would also like to hear from any others that are willing to chime in to get an idea on usages to get a better understanding.

rmathis · ‎10-28-2011

K little background on 1 of my clusters the one I'm sitting at. 4 host with the x5560 cpus and 49GB Ram in each. Attached to iSCSI SAN "Little slow no brand mentioned to avoid bashing". Currently as of this moment 118VM with a mix of services and some desktops. All guest have been build on this environment like they should that includeds the desktops. CPU usage is 35% on all host just about all day and have had no delays in services execpt for the blip when DRS moves the machines around. I'm not going to lie these VM's get hit hard. My problem is storage and that's were my delays come in. It's first gen iSCSI on this cluster and its showing. The new hardware in 4~6 weeks is far nicer. All my VM's are set to use what they will only need based on at least 1 month history. It took awhile to beat some of my other admin's into submission but they have grown to accept that more isnt always better and a few can't tell there on VM's.

I have alot of headroom left on my hosts as most of my delay is the 1st gen iSCSI which has another 2 years before retirement in this cluster.

My average ready times are less then 8% these things get hammered all day everyday for the past 2 years. Waits times I go off the vSphere chart insteed of ESXtop as it presents stats that line up with my crappy storage which is 70ms. Top dose as well it's just I havent had a good runnng record over the past month as the last time I looked they were the same that was 4 months or so ago. Physical server is 55ms so it's not just the VM's felling it. When we first setup we were sitting at 350 to 400ms wait times which was painful. It took alot of tinkering and time to get those down. I'm stuck with an older setup so your times should be far faster when the VM's are trimmered down. Remimber if its not used it goes.

Do yourself a major favor any VM's that are based on hardware deployments need to go away... I added a few just to see and they do add alot of delay to your stuff. If I remimber I'll repost when I get my new hardware in on the times as well as the storage is much quicker which will remove some of the delays. To bad most of my server arnt going to the new stuff kind of a bummer.

woodycollins · ‎10-28-2011

Awesome. Thanks for the information. I am slowly making headway in educating people around how things work in the virtual world. As far as drastic changes such as deploying VM's from cleaner images that will involve alot of red tape cutting and rework on a couple of teams parts but think it can be done and is in the back of my mind on recommendations to submit.

It seems the more I work with the environment, the more work I have a head of me but I find it great myself and will someday get to a better performance level. From the numbers you gave things seem pretty close for averages. From just the work I have done recently across the 5 hosts in the cluster I have been working on we have an average host usage of only 22-25% and ready times on average of 8% even with the boxes that haven't addressed yet with the extremly high number of vCPU's and ready times (I've managed to eleminate some already). I think I can get things down to a much lower range after "rightsizing" the environment. After reading some of the best practice documentation I was just expecting to see a much higher host CPU utilization when approaching the 5+% ready times on VM's. If the documentation I was reading says that 5% ready times is a point you should investigate an environment and shouldn't exceed 10% regularly, I was expect to see the over-all host utilization be higher at times there is higher ready % times.

Just out of curiosity, when you say you have x5560 cpus in your boxes. How many of them are there in each box?

rmathis · ‎10-28-2011

x2 of them Is there any other way to use them well maybe x4.. There nice even two years old they still handle everything that's tossed at them.

Sounds like you've been hard at work getting things cleaned up. I understand the red tape all to well. It's hard to get around but they will see and be happy in the long run. Sounds like you have a good setup going when your done could you toss a little before and after stats up?

woodycollins · ‎10-28-2011

Oh yes I have done alot of work, and expect to do even more as I believe there can be an even better improvement in the environment. I'll see if I can pull some stats of before and after. The problem with getting the before data is that we purge statistic logs fairly often and infact at the beginning of the month managed to purge a good amount of data I would have liked to keep to look at. (Just another process I''ve changed to get a better understanding of what is going on in the environment.)

Let me ask you about something you mentioned earlier. When you stated "When we first setup we were sitting at 350 to 400ms wait times which was painful", are you saying you see wait times on your processors of the VM's at 70ms when looking at the performance charts in vCenter?