In your situation, HA acted correctly. vCenter seeing a host as not responding is not an HA event. If you were able to ping the service console of the ESX host in question, that means heartbeat was available and no HA event should have been generated. The only way an HA event would have occurred is if that ESX host in question lost service console connectivity. From what you are describing, that didn't happen.
So in that situation you're basically just screwed? Was there any way I might have been able to move the VM's in that situation? I wasn't able to open up a putty session to the ESX host in question. I was thinking maybe a powershell command but I didn't have it installed on my computer at home when I got called.
do you have agents installed in your ESX Hosts? We have HP hardware with SIM agents, which monitors and pages us on degraded hardware. This gives us a period of time to get the host into maintenance mode to fix the degraded parts.
However, like you said, it's a tough one. If there is network connectivity to your COS, as far as the cluster is concerned there are no problems.
Plus, if vCenter show's it as not responding, you can't vmotion the guests, even with PS. Nothing like getting pushed into a corner huh?
In short no we don't run hardware agents on the hosts. We run IBM's so I guess they have something out there, but I doubt like HP. I tried to play around with their director piece a while back but it wasn't that compatible with ESX at the time. Funny thing is my director and I had this talk yesterday about hardware monitoring and I told him we just use the ESX monitoring because it does work, a little slow but works.
Definatly backed into a corner becasue I have to answer the questions why did this happen and why did they not move off. And you're right with them not responding I couldn't do anything, not even power the machines off. I tried to change the HA settings on one of the lower VM's to power off in an isolation to try and get it shutdown but no luck there either.
I'll check out the logs and see what VMware has to say, maybe go back to looking at what IBM has to offer for agents.
check into IBM director again, I believe this is possible. Something within the IMM. We just implemented 16 x3750 M2's and our IBM guy was saying they have a piece that can detect this type of issue and generate an maintenance mode request to vCenter even if a part is starting to get errors.
I haven't had time to play with the new gear yet, and not even sure if it's free, but if nothing else, I think you should look into the IMM, by itself I think it can do some hardware monitoring agentless.