The annoying thing is, my Dev environment worked fine, showing Not Responding and HA worked, whereas my production didn't.
The cause of the outage was a power failure which powered down the switch.
The production host is a 3850 M2 and the development hosts are HS21 blades in a BladeChassis.
What would cause the server to show disconnected?
Due to flooding here I am unable to get to work and check power redundacy or if the whole server room went down or just the switch.
How is the storage configured on the server thats behind the failed switch. Is it also configured off that failed switch (iscsi or nfs) or is it Fiber attached?
If its fiber attached and the other VM's were running, the other hosts could not power on the VM's because the files would still be locked.
What is the configuration of your HA settings on the cluster? Leave powered on or power off?
If they are set to leave powered on, then only a physical drop of the server or SAN connection would allow an HA event to be successful (since the file locks would then be released). otherwise the VM's are running isolated, but no other host can power on the VM's.
This is probably a good event to happen in that it caused an outage but not a serious enough one to make your life hell, it should give you more reason to push for redundant switches on the production side to ensure a single switch failure doesn't bring down the environment.
a disconnect/not responding of a host in vCenter, does not consitute an HA event. vCenter is not needed for HA, except for the ititial configuration. You are sure there was an HA event? The ESX(i) Host in question actually went down, unplanned, and there were guests running on this host?
There were around 30 VMs on the host and neither the host or the VMS were pingable or accessible. Even if the host was still powered on, the switch connecting it to the outside world was down, so to all intents and purposes the system was down.
Regards,
David
I have been able to get in and check. The host did not go down but the ethernet switch did. Neither of the other hosts in the cluster could contact the host so doesn't that mean that HA should have kicked in?
No.
HA will kick in if your host fails.
If the switch goes down, all of your hosts are affected, moving your guests will not help out.
Sorry, I should have clarified. The switch outage only affected one host. The other hosts are in a different datacentre and were not affected. The 3850 host did not power down but had no ethernet connectivity.
The entire datacentre lost ethernet connectivity for over 2 hours.
How is the storage configured on the server thats behind the failed switch. Is it also configured off that failed switch (iscsi or nfs) or is it Fiber attached?
If its fiber attached and the other VM's were running, the other hosts could not power on the VM's because the files would still be locked.
What is the configuration of your HA settings on the cluster? Leave powered on or power off?
If they are set to leave powered on, then only a physical drop of the server or SAN connection would allow an HA event to be successful (since the file locks would then be released). otherwise the VM's are running isolated, but no other host can power on the VM's.
This is probably a good event to happen in that it caused an outage but not a serious enough one to make your life hell, it should give you more reason to push for redundant switches on the production side to ensure a single switch failure doesn't bring down the environment.
I don't think redundant switches would have helped. Looks like they lost all comms.
Thanks for that Rumple. The SAN was fibre and not affected and the VMs were set to leave powered on.
just for reference, I prefer leave powered on (incase I accidentally reset the stupid vmware mgmt service and that old bug comes back that thinks the vm's were isolated and shuts the damn things down in the middle of the day (again).
That wasn't a good day let me tell you...
But, if you are gonig to leave them set to Leave powered on, you need to have a good redundant switch arcitecture or it will bite you just like it did (which is almost preferable causing the outage yourself I suppose)