VMware Cloud Community
Dave_McD
Contributor
Contributor
Jump to solution

I had an outage and even though my host was powered off, it only showed disconnected and HA didn't kick in.

The annoying thing is, my Dev environment worked fine, showing Not Responding and HA worked, whereas my production didn't.

The cause of the outage was a power failure which powered down the switch.

The production host is a 3850 M2 and the development hosts are HS21 blades in a BladeChassis.

What would cause the server to show disconnected?

Due to flooding here I am unable to get to work and check power redundacy or if the whole server room went down or just the switch.

0 Kudos
1 Solution

Accepted Solutions
Rumple
Virtuoso
Virtuoso
Jump to solution

How is the storage configured on the server thats behind the failed switch.  Is it also configured off that failed switch (iscsi or nfs) or is it Fiber attached?

If its fiber attached and the other VM's were running, the other hosts could not power on the VM's because the files would still be locked.

What is the configuration of your HA settings on the cluster?  Leave powered on or power off?

If they are set to leave powered on, then only a physical drop of the server or SAN connection would allow an HA event to be successful (since the file locks would then be released).  otherwise the VM's are running isolated, but no other host can power on the VM's.

This is probably a good event to happen in that it caused an outage but not a serious enough one to make your life hell, it should give you more reason to push for redundant switches on the production side to ensure a single switch failure doesn't bring down the environment.

View solution in original post

0 Kudos
9 Replies
Troy_Clavell
Immortal
Immortal
Jump to solution

a disconnect/not responding of a host  in vCenter, does not consitute an HA event. vCenter is not needed for HA, except for the ititial configuration. You are sure there was an HA event?  The ESX(i) Host in question actually went down, unplanned, and there were guests running on this host?

0 Kudos
Dave_McD
Contributor
Contributor
Jump to solution

There were around 30 VMs on the host and neither the host or the VMS were pingable or accessible. Even if the host was still powered on, the switch connecting it to the outside world was down, so to all intents and purposes the system was down.

Regards,

David

0 Kudos
Dave_McD
Contributor
Contributor
Jump to solution

I have been able to get in and check. The host did not go down but the ethernet switch did. Neither of the other hosts in the cluster could contact the host so doesn't that mean that HA should have kicked in?

0 Kudos
AureusStone
Expert
Expert
Jump to solution

No.

HA will kick in if your host fails.

If the switch goes down, all of your hosts are affected, moving your guests will not help out.

0 Kudos
Dave_McD
Contributor
Contributor
Jump to solution

Sorry, I should have clarified. The switch outage only affected one host. The other hosts are in a different datacentre and were not affected. The 3850 host did not power down but had no ethernet connectivity.

The entire datacentre lost ethernet connectivity for over 2 hours.

0 Kudos
Rumple
Virtuoso
Virtuoso
Jump to solution

How is the storage configured on the server thats behind the failed switch.  Is it also configured off that failed switch (iscsi or nfs) or is it Fiber attached?

If its fiber attached and the other VM's were running, the other hosts could not power on the VM's because the files would still be locked.

What is the configuration of your HA settings on the cluster?  Leave powered on or power off?

If they are set to leave powered on, then only a physical drop of the server or SAN connection would allow an HA event to be successful (since the file locks would then be released).  otherwise the VM's are running isolated, but no other host can power on the VM's.

This is probably a good event to happen in that it caused an outage but not a serious enough one to make your life hell, it should give you more reason to push for redundant switches on the production side to ensure a single switch failure doesn't bring down the environment.

0 Kudos
AureusStone
Expert
Expert
Jump to solution

I don't think redundant switches would have helped.  Looks like they lost all comms.

0 Kudos
Dave_McD
Contributor
Contributor
Jump to solution

Thanks for that Rumple. The SAN was fibre and not affected and the VMs were set to leave powered on.

0 Kudos
Rumple
Virtuoso
Virtuoso
Jump to solution

just for reference, I prefer leave powered on (incase I accidentally reset the stupid vmware mgmt service and that old bug comes back that thinks the vm's were isolated and shuts the damn things down in the middle of the day (again).

That wasn't a good day let me tell you...

But, if you are gonig to leave them set to Leave powered on, you need to have a good redundant switch arcitecture or it will bite you just like it did (which is almost preferable causing the outage yourself I suppose)

0 Kudos