I have 4 servers in this cluster and it seems only to be happening to the dell 2850's the 2650's and 2950's seem to be fine, but I have a couple of multi proc 2000 boxes that were p2v'd and since the upgrade they are using about 4000 mhz of proc with 18 shares and memory usage is ugly as well at 95% of the shares but only 205 megs used out of 2 gigs given to the vm's. I have updated the vmtools and even tried to make them single proc boxes but that caused ungodly slowness. Not sure if anyone else has seen this or knows a cure but I am open for suggestions, right now the cpu is setup for unlimited with normal shares.
Foundation -> previously Starter.
Thanks for reply etieseler. I'll read it.
We're having the same problems with our three 3.5 hosts. :_|
Two are still the original build, one has been patched - made no difference. We have three 2850's (2x2.8GHz DC Xeon, 8GB, 8xGbE) and all of them show the exact same problem - whenever we vmotion or svmotion a virtual machine, it does so well enough, but once it's done it will hit 100% CPU load whenever anyone writes to disk (either VMDK or MSiSCSI). This is especially evident on our Citrix server.
Completely shutting the VM down - waiting a minute - then turing it back on seems to work so far, but I'd rather be able to do this without kicking everyone off Citrix :smileyblush:
Anyway, I've disabled DRS for the time being and will see if that helps. We're moving our accouting (SQL) server now, so I'll let you know good or bad if DRS helped.
I had 3 Clusters 1 with 4 2650's 1 with 4 2850's and the other with 3 2950's and I was hoping that this post would spark something but it looks as though I not the only one with this problem since the post I have rolled all servers back and life on the farm is great again. But all esx servers were almost using all their cpu power vm's were at 90 to 100 cpu usage and users were not happy but now the servers dont break the 50% mark on cpu usage and vm server performance is back up. I guess I will wait a while before going back, I looked at the patches out there and none seem to pertain to this even then I will do as I should have to begin with and just upgrade the 2650 Test cluster and make sure things will act right. But I left VCenter on the latest even though they took away the 64bit client and made it 32 bit only.
Sounds like VMware know about this - might be the problems shown in KB 1003638. I'm trying it now (but fighting tooth and nail to get this setting "to stick").
Windows 2003 used to be a real pain in what concerns upgrading to SMP and then reverting back to Uniprocessor. I even write a full article describing a simple method to do it.
However, with SP2 we have been blessed with the possibility to revert to Uniprocessor HAL (not the Standard ACPI that used to be a dead end), from which we can again upgrade to Multiprocessor, as many times as we wish.
I'm talking about changing it from the Device Manager, as other methods (some quite convoluted) have always been available.
Hmm - only the Virtual Center fix worked for me - manually changing always had the server reset it back to zero. Anyway, it's set to 6 for me, and I just vmotioned a machine without incident. I'll try a couple more that were giving me problems to see if that did it - if so I think this one's fixed.B-)
Thanks nonarkitten for the KB article that matches my problem to a T and the date was yesterday they published this a day after I rolled back.
This is more of a confirmation than anything. I have had the same problem since the release of ESX 3.5 and VC 2.5. We like to keep our DRS settings set in the middle to aggressive for automatic vmotion of servers to keep the load balance nice. This hurt our systems SO BAD, that we were almost unworkable. Turned off DRS and the problems vanished, other than we have come to rely on DRS. After talking with VMware over the past few weeks they finally sent me something last night, basically same thing all of you are saying, and they said it was being worked under 227676, which I searched for and found this post. I then follwed the KB article mentioned earlier, 1003638, and this explained my problem exactly.
I am waiting for approval from management to apply the fix, but I am confident, based on their explanation of "why", that this will fix our problem, hopefully I will be able to get this working over the weekend and keep an eye on it. I will post back with results.
No problem, miles We've been running it now for closing-in-on two days and it looks like the problem has been beat. Probably safe for you to reapply 3.5 I know - sucks that they JUST came out with it after you rolled back. I'd apply the fix before you upgrade anything though so you can leave DRS on and vmotion without prejudice. Happy upgrading.
Edit: Oh, oh - do I get points for answering this?
for me only the vpxd.cfg fixed the problem. the rest was rolled back in minutes.
if you would install VC 2.5 and change the vpxd.cfg and then install esx 3.5 then you would have no problem.
The real problem lies in VC 2.5 like nonarkitten also points out.
after changing the vpxd.cfg and rebooting the VC server (or restarting the service) it took from 1 to 5 minutes to replicate.
We've had the exact same problems since we deployed this months patches. Was concerned thinking we had to wipe/load back to 3.5. We have a few large Microsoft TS clusters that were running nearly 95% all the time regardless of how many users were logged into them. We had some prodution web servers shoot their CPU up to 99% last monday and then miraculously fix themselves 30-60 minutes later. I initially looked on the forums and didn't find anything but tonight I just was browsing again the stumbled upon this. I went to VC and disabled DRS and within 60 seconds all of our TS environments are now down to 1%-5% as expected. I'll do the "workaround" tomorrow evening if tomorrow works well with DRS disabled. Frustrating. We figured it was this months patches, BUT, most likely all of the VMotion'ing we did during the updates caused the issues.
I've had a similar thing happen to me. After vmotioning some vms from my ESX 3.0.2 to 3.5 servers the guest CPU would run near 100% with random processes eating up the cpu. If I killed one high cpu process another process would spin up to 100% to take its place. So far it's happened on 6 of my 350 vms and all 6 of those have had SQL installed on them. The only fix I found was to shut down the vm, unregister it from virtual center, make a backup of the config file and then edit it's config file and remove all of the CPUID masking entries that were added during the vmotion from 3.0.2 to 3.5. Then reregistering the new config file and powering it up seemed to fix the issues.
I've turned off DRS on my cluster because once this problem starts in a vm it's pretty much unusable to the end users until fixed. Does anyone know how to remove the CPUID entries without having to power off and reregister the vm?
We went through the exact same problem this week. 3.0.2 to 3.5 U1. VC 2.5 U1. It appears VMware still hasnt fixed this issue. Vmotion from 3.0.2 host on SQL VMs caused random processes to spool to 100%. Vmotion back to 3.0.2, the issue would go away. When the VM was on the 3.5 host, unchecking DRS, waiting 60 seconds would cause the vCPU usage to fall. Once we, re-checked the issue appeared to be gone. We had to repeat this process for every 3.5 host in the cluster. This resulted in some unhappy users and unresponsive/crashing apps. Seems fine now.
I have planned to migrate few Cluster this week-end from ESX 3.02 to 3.5 ...
Is this issue solved now with the latest version of Virtual Center ? Following the KB , it must be the case with the latest Vcenter 2.5.0 Update1
The short answer is no. We did a clean install of ESX 3.5 U1 on our hosts and performed an in place update of VC 2.5 U1. For us however, it seemed to be an easy fix as we just needed to uncheck DRS, wait a 60 seconds, and the problem resolved itself. This needed to be done for each host. You may want to wait until your maintenance window to vmotion your SQL VMs from your 3.0.2 hosts to your 3.5 hosts.