I'm going to open up a ticket with VMware but will throw this out to you guys since I usually get better / faster response times, last night a host kind of went down but HA didn't kick in to move the VM's to a different host. The host was grayed out with a not connected, and the VM's that were running on that host were disconnected, I wasn't able to do much with any of them. I was able to ping the host, the VM's running on the box, and could even initiate a RDP session, but once the RDP session connected it would freeze up.
Our server had a hardware error light on the RAID so I'm thinking it had something to do with a bad HDD, but we rebooted the host and everything came back up with no hardware errors. Is this one of those cases that the host is "down" but not down enough for HA to kick in?
In your situation, HA acted correctly. vCenter seeing a host as not responding is not an HA event. If you were able to ping the service console of the ESX host in question, that means heartbeat was available and no HA event should have been generated. The only way an HA event would have occurred is if that ESX host in question lost service console connectivity. From what you are describing, that didn't happen.
So in that situation you're basically just screwed? Was there any way I might have been able to move the VM's in that situation? I wasn't able to open up a putty session to the ESX host in question. I was thinking maybe a powershell command but I didn't have it installed on my computer at home when I got called.
do you have agents installed in your ESX Hosts? We have HP hardware with SIM agents, which monitors and pages us on degraded hardware. This gives us a period of time to get the host into maintenance mode to fix the degraded parts.
However, like you said, it's a tough one. If there is network connectivity to your COS, as far as the cluster is concerned there are no problems.
Plus, if vCenter show's it as not responding, you can't vmotion the guests, even with PS. Nothing like getting pushed into a corner huh?
In short no we don't run hardware agents on the hosts. We run IBM's so I guess they have something out there, but I doubt like HP. I tried to play around with their director piece a while back but it wasn't that compatible with ESX at the time. Funny thing is my director and I had this talk yesterday about hardware monitoring and I told him we just use the ESX monitoring because it does work, a little slow but works.
Definatly backed into a corner becasue I have to answer the questions why did this happen and why did they not move off. And you're right with them not responding I couldn't do anything, not even power the machines off. I tried to change the HA settings on one of the lower VM's to power off in an isolation to try and get it shutdown but no luck there either.
I'll check out the logs and see what VMware has to say, maybe go back to looking at what IBM has to offer for agents.
check into IBM director again, I believe this is possible. Something within the IMM. We just implemented 16 x3750 M2's and our IBM guy was saying they have a piece that can detect this type of issue and generate an maintenance mode request to vCenter even if a part is starting to get errors.
I haven't had time to play with the new gear yet, and not even sure if it's free, but if nothing else, I think you should look into the IMM, by itself I think it can do some hardware monitoring agentless.