scooter_
Contributor
Contributor

Testing DAS/NIC Connection Failure in HA Cluster

I have two hosts in a cluster. I'm working on testing HA failures. I've been successful at testing a machine failure by unplugging a Host and having all of the VMs from the dead host transfer to the other remaining host and boot up.  I can then Migrate the hsot back to the Original Server.  

Now I'd like to test a failure of the DAS Connection.   I can succesfully fail a Single SAS cable connection to the DAS. Though when I yank both SAS Cables. I dont seem to get any results that I would expect. No Warnings on the Cluster, no warnings on the Host.  Going into the Host, then looking at Storage Adapters, All looks as though everything is still connected.  If I hit Refresh though, everthing changes to "Loading..." and stays there indefinatly. If I switch away and go back. The status shows all good for the HBA, though the refresh is not active to click.

The VMs on the Host are dead as I cannot Conect to them.

I Plug one SAS cable back in, and things come back alive, The VMs didn't get rebooted, but are just now available again.

The Host Summary Page on BOTH hosts now have a Warning: the number of vSphere HA Heartbeat Datastores for this host is 0, which is less than the required: 2

At some point the Warnings go away and all is good in the world.

This is my First Cluster, so I'm new at this.  I would have expected that after several minutes of no DAS Connection that the other Host in the Cluster would have picked up the VMs from the failed Host.

I have yet to test NIC failure, though I have a feeling it will go the same way...

What is going wrong here?

Scott<-

Tags (4)
0 Kudos
6 Replies
vmroyale
Immortal
Immortal

Note: Discussion successfully moved from VMware ESXi 5 to Availability: HA & FT

Brian Atkinson | vExpert | VMTN Moderator | Author of "VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: VCP-550" | @vmroyale | http://vmroyale.com
0 Kudos
aravinds3107
Virtuoso
Virtuoso

You are seeing exactly what is expected, HA would restart the VM's only in case of host failure and NOT when host lost connection to the storage.

Take a look at HA Deepdive page to know more about HA

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful |Blog: http://aravindsivaraman.com/ | Twitter : ss_aravind
0 Kudos
scooter_
Contributor
Contributor

Thank you for your reply.

I've read through the HA Deepdive. So you are saying that a Host that gets Disconnected from its Datastore wont failover if it still has a Network connection to the cluster?

I'm also unclear as to what happens when a Host is Isolated. Seems like if the Host is Isolated (cannot reach Gateway) that it would not start the VMs of the hosts in the cluster. Though if a Host is Partitioned that it would restart the VMs if it could not reach other hosts?

I'm trying to come up with a Virtual Environment Redundancy Failover Test Plan outlining different failures. I was thinking the common failures were Host/power, SAS Partiual , SAS Complete, NIC Partial, NIC Complete. I'm not sure how I would document the Datastore Sccess Failure, as nothing failed over.

2          Host Power/Hardware Failure

2.1              How to initiate the failure

2.2              What to expect from the failure

2.3              How to recover from the failure

3          DAS Single SAS Cable Failure

3.1              How to initiate the failure

3.2              What to expect from the failure

3.3              How to recover from the failure

4          DAS Dual SAS Cable Failure

4.1              How to initiate the failure

4.2              What to expect from the failure

4.3              How to recover from the failure

5          Single NIC Cable Failure

5.1              How to initiate the failure

5.2              What to expect from the failure

5.3              How to recover from the failure

6          Multiple NIC Cable Failure

6.1              How to initiate the failure

6.2              What to expect from the failure

6.3              How to recover from the failure

Scott,<-

0 Kudos
depping
Leadership
Leadership

If there is still a network heartbeat the workloads will not fail-over indeed.

0 Kudos
aravinds3107
Virtuoso
Virtuoso

HA will not failover VM's to other host in the cluster if it loses connection to the datastore

I hope you are clear with Network parition, so in case of partition a election process will be initated to identify is the host is failed or isolated, If the host is not receiving communication with datastore hearbeating it will be declared as failed and restart the VM.

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful |Blog: http://aravindsivaraman.com/ | Twitter : ss_aravind
0 Kudos
depping
Leadership
Leadership

Aravind Sivaraman wrote:

I hope you are clear with Network parition, so in case of partition a election process will be initated to identify is the host is failed or isolated, If the host is not receiving communication with datastore hearbeating it will be declared as failed and restart the VM.

huh? I think you are mixing up several concepts... even if datastore heartbeats are issued VMs can still be restarted if and when the host has declared itself isolated and reports this.

0 Kudos