Hi, We've recently migrated approximately 15 hosts into a 4-host ESX 3.5 cluster (HA + DRS enabled) but we've noticed some odd issues with some of the guest OSs. Basically what happens is that ...
See more...
Hi, We've recently migrated approximately 15 hosts into a 4-host ESX 3.5 cluster (HA + DRS enabled) but we've noticed some odd issues with some of the guest OSs. Basically what happens is that a couple of times a day (no correlation sadly - the times appear random) a couple of our guest OSs will stop responding in a timely fashion to our monitoring software Nagios and when we go to investigate, we find the guest OS almost completely unresponsive. SSH logins take minutes, network services on the guest time out or send partial responses seperated with minutes or more of no activity, and even console logins via the VC console can take a minute or more. Note, as far as we can tell this issue affects only two guests significantly, and these are both linux (but with different kernels, neither of which are unique on the ESX cluster.) It may affect windows hosts too or other hosts, but we've not seen it really affect them enough to bother us yet. On the times we've managed to log on, the only thing we've managed to spot is that 'top' reports an enormous amount of time spent in the "wa" wait state (see attached "top-guest.txt" for an example from this morning.) There's also a rather high "sy" time reported, though we're not sure which if either of these is the cause or effect. We've had this on and off now for over a month, and we have been trying to nail down exactly what the bottleneck in performance might be, and we're now fairly sure that there is no performance bottleneck - just something that seems to be clogging up the guests. Running "top" on the ESX host that the unresponsive guest is located on seems to show the host almost completely idle (see "top-host.txt" attached.) Note, the other guests on the server, along with the ESX host itself generally all report very low to almost idle cpu, memory, and network (see attached "stats-host.png" for an example of what the usage graphs generally show.) We're able to log in with no delays, and they all feel very responsive. I read the manuals at first thinking we were getting a vmware virtual memory vs host virtual memory thrash going on, but that doesn't appear to be the problem since we've never had a non-zero balloon value (and our hosts are rediculously undercommitted.) I then thought perhaps we might have a badly organised resource pool structure, so we completely re-organised that and gave the problematic guest OSs a lot of reservations etc. That seemed to alleviate the problem for a few days but it seemed to get worse again and this morning it's just been up and down like crazy. I looked into esxtop on the ESX host but unfortunately the column I was most interested in "%WAIT" shows values in the tens of thousands of percents, so either this is showing there really is a massive wait problem, or esxtop isn't giving us very good readings I read the manual on it but it's not very helpful at least to someone like me who isn't already an expert in the concepts (but as an aside, if that's the case, what's the manual for?!) So, here's a brief rundown of our setup: 1 PowerEdge 2850 with ~150GB of onboard raid storage running opensolaris and ZFS shared via NFS as the backend storage for the cluster. 1 PowerEdge 2850 running windows with Virtual Center acting as the license and cluster manager 4 PowerEdge 2850s with dual 3.6GHz Xeon processors (+HT), 4GB of RAM running ESX 3.5, dual gigabit nics 2 gigabit switches configured so each esx host has an active/standby configuration with one nic connected to each switch. We have HA and DRS enabled, and all hosts have the same network and storage configuration (so hosts are free to migrate around anywhere DRS wants them.) Each vm host has it's own NFS storage device/pool exported from the ZFS/NFS host. We've already thought about network or disk overutilisation as the cause, and we can confirm that the disks are well underutilized on the ZFS host, and the network links are well below any significant load levels. As I said, we've spent a month monitoring the entire system end to end, and at the times the guests have problems, the only thing in the entire system that isn't almost completely idle is the guest os, and it's sitting in wait and system time amost exclusively getting no real work done. Oh, I also did some google searches and some searching in the forums/communities here and couldn't find anything that turned out to really help with the problem (though on google I did spot one group of people saying linux shouldn't get 1GB or more ram as a guest, so we dropped our guest to 768 and it made it a lot better, but didn't completely remove the problem - the difference was when we upped the ram to 1gb to try and improve performance it basically mean this wait problem was present almost 100% of the time, and when we dropped it to 768, it only affects us maybe 2-3 times a day for 10/15mins at a time.) Anything else anyone needs? Any suggestions?