Our env: 2 ESX servers 3.5 U3 and vCenter 2.5 U3 configured with HA.
The other day I performed a test to our UPS (that supply power to all of our servers) The ESX Servers have HP ups agents for enabling graceful shutdowns. The issue is when the ESX servers shutdown the HA goes nuts since both servers at shutting down they both want to failover to the other and the whole thing is a chaos. VMs do not shutdown gracefully etc.
Can somebody suggest how ESX servers with HA should be configured for to handle correctly power outages? (both ESX servers should shutdown VMs gracefully and then shutdown themselves gracefully as well)
HA works on the assumption that at least one of your hosts will be up, You should look at changing the gracefull shutdown periods so that one hosts goes down, the VMs failover and then the first has time tocome backup before the other shuts down.
Look at getting a 3rd host on a different ups
Move one host to a different ups.
I had this problem twice in 20 days. The first time a network change rocked my HA world. A new core switch was put in place and the spanning tree mode was changed which went out to the other switches to change this.
I have two clusters a 2 way and a 6 way. When the spanning tree reset there was a 35 second network outage. HA by default is set to 15 seconds and so all of my hosts thought they were isolated and "Powerd Off" the VM's, another default of HA. So basically I lost 140+ VM's to a power off, not good at all especially for Windows servers like Exchange:(. HA freaked out when all of those VM's tried to start backup and at the same time and DRS was going nuts, the VC server was one of those servers. It was a long outage.
After that event I changed the time out to 60 seconds and then I set the HA to shutdown the guests, not power them off. I will also probably install a dumb switch, not attached to the rest of the network, and put another NIC in each host that is a backup service console port group into that dumb switch in case the main network goes down. I also created some good documentation that was passed around about SHUTTING OFF HA when major network changes are going on.
The next event was a full on power outage in that computer room. We brought up some new NetApp SAN, 4 of them, it pulled to much power, the UPS for the room detected it could not handle it and shut down which blew the circuit breaker. That time I knew which host the VC server was on and I brought it up first when the SAN was ready. I then Turned OFF HA, brought all of the ESX hosts up and once up and running turned on HA again.
You got to have a plan, you got to communicate that plan. There is no magic bullet and these types of events are really not what HA was built for. HA was created for a HOST failure, not a total enviroment failure. That said living through it a few times I dont wish to do it again anytime soon.
Best of Luck!