VMware Cloud Community
milesmcever
Contributor
Contributor

High CPU usage since 3.5 upgrade

I have 4 servers in this cluster and it seems only to be happening to the dell 2850's the 2650's and 2950's seem to be fine, but I have a couple of multi proc 2000 boxes that were p2v'd and since the upgrade they are using about 4000 mhz of proc with 18 shares and memory usage is ugly as well at 95% of the shares but only 205 megs used out of 2 gigs given to the vm's. I have updated the vmtools and even tried to make them single proc boxes but that caused ungodly slowness. Not sure if anyone else has seen this or knows a cure but I am open for suggestions, right now the cpu is setup for unlimited with normal shares.

0 Kudos
35 Replies
tomgo
Contributor
Contributor

Foundation -> previously Starter.

Thanks for reply etieseler. I'll read it.

pozdr.

tox

0 Kudos
nonarKitten
Enthusiast
Enthusiast

We're having the same problems with our three 3.5 hosts. :_|

Two are still the original build, one has been patched - made no difference. We have three 2850's (2x2.8GHz DC Xeon, 8GB, 8xGbE) and all of them show the exact same problem - whenever we vmotion or svmotion a virtual machine, it does so well enough, but once it's done it will hit 100% CPU load whenever anyone writes to disk (either VMDK or MSiSCSI). This is especially evident on our Citrix server.

Completely shutting the VM down - waiting a minute - then turing it back on seems to work so far, but I'd rather be able to do this without kicking everyone off Citrix :smileyblush:

Anyway, I've disabled DRS for the time being and will see if that helps. We're moving our accouting (SQL) server now, so I'll let you know good or bad if DRS helped.

-- nonarKitten

0 Kudos
milesmcever
Contributor
Contributor

I had 3 Clusters 1 with 4 2650's 1 with 4 2850's and the other with 3 2950's and I was hoping that this post would spark something but it looks as though I not the only one with this problem since the post I have rolled all servers back and life on the farm is great again. But all esx servers were almost using all their cpu power vm's were at 90 to 100 cpu usage and users were not happy but now the servers dont break the 50% mark on cpu usage and vm server performance is back up. I guess I will wait a while before going back, I looked at the patches out there and none seem to pertain to this even then I will do as I should have to begin with and just upgrade the 2650 Test cluster and make sure things will act right. But I left VCenter on the latest even though they took away the 64bit client and made it 32 bit only.

0 Kudos
nonarKitten
Enthusiast
Enthusiast

Sounds like VMware know about this - might be the problems shown in KB 1003638. I'm trying it now (but fighting tooth and nail to get this setting "to stick").

0 Kudos
paulo_meireles

Windows 2003 used to be a real pain in what concerns upgrading to SMP and then reverting back to Uniprocessor. I even write a full article describing a simple method to do it.

However, with SP2 we have been blessed with the possibility to revert to Uniprocessor HAL (not the Standard ACPI that used to be a dead end), from which we can again upgrade to Multiprocessor, as many times as we wish.

I'm talking about changing it from the Device Manager, as other methods (some quite convoluted) have always been available.

Regards,

Paulo

0 Kudos
nonarKitten
Enthusiast
Enthusiast

Hmm - only the Virtual Center fix worked for me - manually changing always had the server reset it back to zero. Anyway, it's set to 6 for me, and I just vmotioned a machine without incident. I'll try a couple more that were giving me problems to see if that did it - if so I think this one's fixed.B-)

0 Kudos
milesmcever
Contributor
Contributor

Thanks nonarkitten for the KB article that matches my problem to a T and the date was yesterday they published this a day after I rolled back.

0 Kudos
FireDog7881
Contributor
Contributor

This is more of a confirmation than anything. I have had the same problem since the release of ESX 3.5 and VC 2.5. We like to keep our DRS settings set in the middle to aggressive for automatic vmotion of servers to keep the load balance nice. This hurt our systems SO BAD, that we were almost unworkable. Turned off DRS and the problems vanished, other than we have come to rely on DRS. After talking with VMware over the past few weeks they finally sent me something last night, basically same thing all of you are saying, and they said it was being worked under 227676, which I searched for and found this post. I then follwed the KB article mentioned earlier, 1003638, and this explained my problem exactly.

I am waiting for approval from management to apply the fix, but I am confident, based on their explanation of "why", that this will fix our problem, hopefully I will be able to get this working over the weekend and keep an eye on it. I will post back with results.

0 Kudos
nonarKitten
Enthusiast
Enthusiast

No problem, miles Smiley Happy We've been running it now for closing-in-on two days and it looks like the problem has been beat. Probably safe for you to reapply 3.5 Smiley Happy I know - sucks that they JUST came out with it after you rolled back. I'd apply the fix before you upgrade anything though so you can leave DRS on and vmotion without prejudice. Happy upgrading.

Edit: Oh, oh - do I get points for answering this?

0 Kudos
Gui
Enthusiast
Enthusiast

for me only the vpxd.cfg fixed the problem. the rest was rolled back in minutes.

if you would install VC 2.5 and change the vpxd.cfg and then install esx 3.5 then you would have no problem.

The real problem lies in VC 2.5 like nonarkitten also points out.

after changing the vpxd.cfg and rebooting the VC server (or restarting the service) it took from 1 to 5 minutes to replicate.

0 Kudos
Quigibo
Contributor
Contributor

We've had the exact same problems since we deployed this months patches. Was concerned thinking we had to wipe/load back to 3.5. We have a few large Microsoft TS clusters that were running nearly 95% all the time regardless of how many users were logged into them. We had some prodution web servers shoot their CPU up to 99% last monday and then miraculously fix themselves 30-60 minutes later. I initially looked on the forums and didn't find anything but tonight I just was browsing again the stumbled upon this. I went to VC and disabled DRS and within 60 seconds all of our TS environments are now down to 1%-5% as expected. I'll do the "workaround" tomorrow evening if tomorrow works well with DRS disabled. Frustrating. We figured it was this months patches, BUT, most likely all of the VMotion'ing we did during the updates caused the issues.

0 Kudos
wearmg
Contributor
Contributor

I've had a similar thing happen to me. After vmotioning some vms from my ESX 3.0.2 to 3.5 servers the guest CPU would run near 100% with random processes eating up the cpu. If I killed one high cpu process another process would spin up to 100% to take its place. So far it's happened on 6 of my 350 vms and all 6 of those have had SQL installed on them. The only fix I found was to shut down the vm, unregister it from virtual center, make a backup of the config file and then edit it's config file and remove all of the CPUID masking entries that were added during the vmotion from 3.0.2 to 3.5. Then reregistering the new config file and powering it up seemed to fix the issues.

I've turned off DRS on my cluster because once this problem starts in a vm it's pretty much unusable to the end users until fixed. Does anyone know how to remove the CPUID entries without having to power off and reregister the vm?

0 Kudos
Cody_Page
Enthusiast
Enthusiast

We went through the exact same problem this week. 3.0.2 to 3.5 U1. VC 2.5 U1. It appears VMware still hasnt fixed this issue. Vmotion from 3.0.2 host on SQL VMs caused random processes to spool to 100%. Vmotion back to 3.0.2, the issue would go away. When the VM was on the 3.5 host, unchecking DRS, waiting 60 seconds would cause the vCPU usage to fall. Once we, re-checked the issue appeared to be gone. We had to repeat this process for every 3.5 host in the cluster. This resulted in some unhappy users and unresponsive/crashing apps. Seems fine now.

0 Kudos
JeromeWentink
Contributor
Contributor

The issue mentioned by "Code Page" sounds like an issue that VMWare talkes about in this article:

Hope this will help. I ran into this issue a while ago.

0 Kudos
unclefab
Contributor
Contributor

Hi all,

I have planned to migrate few Cluster this week-end from ESX 3.02 to 3.5 ...

Is this issue solved now with the latest version of Virtual Center ? Following the KB , it must be the case with the latest Vcenter 2.5.0 Update1

Thanks,

Fabien

0 Kudos
Cody_Page
Enthusiast
Enthusiast

The short answer is no. We did a clean install of ESX 3.5 U1 on our hosts and performed an in place update of VC 2.5 U1. For us however, it seemed to be an easy fix as we just needed to uncheck DRS, wait a 60 seconds, and the problem resolved itself. This needed to be done for each host. You may want to wait until your maintenance window to vmotion your SQL VMs from your 3.0.2 hosts to your 3.5 hosts.

0 Kudos