We have 2 servers, each running ESX server 2.5.4 build 32233. The CPU untilisation for both systems is running between 95-100%, yet all other resources are running at normal level. Some of the guest servers (Windows 2003) will not boot up, they blue screen when loading Windows. On the servers which are up, some are displaying the following error in Event Viewer: An error was detected on device \Device\Harddisk1\DR1 during a paging operation. Stroage for both is on a SAN, this has other devices accessing and performance is fine. Physical servers are showing no errors. Does anyone have any ideas ?
do you have any third party agents installed in the COS, HP mgmt or Backup agents?
check you running processes as it sounds as though there is an app of some kind kicking the hell out of your boxes.
Update on this, contrary to what I previously wrote - the problem was being caused by a component in the SAN, the partition for the VM's was switching between the 2 controllers on the SAN, in the short term we have taken one of the controllers offline, and performance is looking alot better.
no Richard, what i mean is, there are 2 multi-path options for ESX these are MRU (Most Recently Used) or Fixed Path (I want this one...). dependant on your SAN type you will be recommended to use one or the other by the Vendor.
MRU is good in the type of situation you have as it will fail-over at the first loss of connectivity and then stay put until the next change, where as Fixed Path will always try to use the path of your choice. What can happen with Fixed Path is that an intermittent connection can cause a "flip-flop" effect on the server, the path goes down and fails over then it comes up and fails back then it goes down etc....
check the SAN vendor release notes for the recommended policy.