We use ESX 3.5 to host a bunch of WebSphere servers, couple of HTTP servers and content manager server which all form the front end to an enterprise application relying on an IBM iSeries.
Our four application servers which are guests on the ESX run Windows 2003 Advanced server and we have had two separate incidents where these four machines lock up (RDP becomes unresponsive, ping to guest times out, applications on the guest report CPU starvation etc). With both incidents, the issue lasts for 4~5 days and during this time the 4 servers almost take turns at blacking out for a few minutes. Needless to say, during the black outs, user experience of the application is terrible and generally they are thrown out.
On the ESX host in question, it is the four main application servers that are worst affected. Our HTTP servers don't seem to have the issue nor does our content manager server (which again is WebSphere based). We have a couple of 32bit Win2003 guests on this ESX too which do a tiny bit of work and will rarely report CPU starvation.
I am confident it is not the application running on the four 64bit guests causing the problem as they only run during the day (app servers [jvms] are stopped in the evening) and yet the issue still occurs. It does appear that load makes the problem worse though.
I do not have access to the ESX tools such as esxtop but our admin guys have given me some esxtop data which I have looked at. The interesting thing I am seeing is periods of high % ready time for an ESX that is currently suffering a black out (100+ percent). During this, total physical CPU is nowhere near 100%, more like 50~60%. I have arrived at the conclusion it is not a lack of physical CPU starvation that is causing the issue but it seems to be because a shared resource that the VM needs is unavailable and hence ends up in the ready state for long periods.
I am looking for any ideas as to what questions I should be asking the admin guys to try and resolve this. The first issue happened in late Nov 2010, it has recently occurred again at the end of Mar into the beginning of April. On both occassions, the issue resolved itself with no (authorised) changes being made. In between the the really serious incidents, we have had other spells where CPU starvation has been reported by WebSphere JVMs but only around 5 instances across the 4 VMs compared to the hundreds a day we were getting when the servers were completing blacking out.
On all of the guest VMs on this ESX we now have our own Java heartbeat monitoring running 24/7 to detect CPU starvation where there is a scheduling delay of more than 2 seconds.
Any help would be greatly appreciated on this one.
PS, latest VMWare tools are loaded on the guest VMs. However, I am told there are updates available for the ESX host itself but admin guys are relunctant to take the new updates for fear of introducing problems.