We recently had an event with our VMware cluster that caused a host, and consequently several guests, to go offline. I was on vacation during the event so details are a bit sketchy but it went something like this. One of my team noticed a service was not responding. He went to the VM and tried to get on console but got an error and could not. In addition several other guests on this hosts where not responding to ping. He noticed that the host had issued an alert for memory usage and it seemed as if the host had been run out of memory. When he tried power cycling the guest all options went gray along with the status of several other guests. After several minutes of this he logged into the console of the host and tried to reboot. After another several minutes of this the host still appeared to be hung (along with several guests). After rebooting the server via the DRAC, the guests started to migrate over (via HA) to other hosts. It seems that something spiked the load on this hosts (perhaps several guests doing something at the same time). All this time other hosts in the cluster had plenty of free memory.
I’ve spent several hours pouring over the DRS documentation in an attempt to better understand how it works. The biggest take away I have from it is that DRS does not really attempt to load balance but rather ensure that VMs have access to sufficient resources. Well, in this case it seems to have failed so I need to make some changes. However, I’m at a loss as to what to change. I’m currently in the middle of the DPM threshold at 3. I don’t have any reservations or shares. I could move to a more, or even the most aggressive threshold (5), but from what I have read that probably won’t cause the cluster to rebalance. Are reservations and limits really my only choice?
It seems kind of silly to me that there isn’t a setting that says “If host A is using 90% of its memory and host B is using 10%, move stuff from A to B”. In this situation A could theoretically be able to service all guests or provide sufficient resources and this is probably more efficient but it also seems to be a lot riskier. If an event comes along (backup, web crawl, etc.) that suddenly spikes load for a number of VMs the host might not be able to fulfill all the requirements and resources on host B have gone to waste and are sitting idle.
Any help in this would be greatly appreciated.