Question regarding CPU utilization

dglick · ‎11-07-2008

We have 4 physical servers running a VMWare ESX cluster with about 14 virtualized servers spread accross. Yesterday one of the physical servers CPU overheated and crashed. All the VMServers on that particular server were moved automatically and that was successfull. However, our Web VM server (one of the vm's on the crashed machine) seemed to have a huge issue after it was transfered. Basically the machine was extremely sluggish, to the point of software failing due to timeouts. On that machine we have a web site hosted by IIS and a sql server. SQL appeared to the culprit at first, because it's CPU utilization was very high in task manager and perfmon. But even when shutting down SQL service the server still seemed to be slow and not performing as it usaully does. All queries ran against the databases were taking very long times to complete, and that's why the web site was failing as well.

We tried troubleshooting everything to determine why the server was running so poorly. We ran all diagnostics against SQL, restarted the VM machhine, restarted the physical VM server, brought back the crashed server and moved the web server back, but nothing worked. About 5 hours later (after we already started creating a new non-vm physical server to transfer our website to) the issue just seemed to vanish.

So while thankfully the issue is gone...I am a bit nervous about what this problem was. I am wondering if the CPU utilization metric in that instance was not a good indication, and there was a bottleneck ssomewhere else due to the VM server recovering from the crash. My gut tells me this was not a SQL issue, but if it's not SQL then what was it?

Any ideas? Thanks!

Number774 · ‎12-02-2008

Have a look at the memory use - especially ballooning. I've got a problem where ballooning seems to use inordinate amounts of CPU, asnd that might be what you are hitting.

Given time, VMWare spots virtual pages across the VMs with the same contents, and maps them to the same physical pages. That reduces the physical memory consumption, possibly to the point where ballooning is no longer needed.

Ken_Cline · ‎12-02-2008

How many VMs were on the host that failed? I'm assuming that the host was in an HA-enabled cluster and that HA did the automatic restart of the failed VMs, yes? When HA restarts the failed VMs, it will start them all on the same host. This can put a tremendous strain on that host and cause performance problems. If your cluster is DRS-enabled, DRS will - over time - move VMs arount to help alleviate contention. With only 14 VMs in your four-host cluster, I wouldn't expect this to be a huge problem, but it could explain part of it...

Ken Cline VMware vExpert 2009 VMware Communities User Moderator Blogging at: http://KensVirtualReality.wordpress.com/