I'm at the tail end of setting up a new cluster for a customer, and while planning for some cable-pull testing ran into an unexpected behaviour, so hoping to get some insight into it.
Before I explain the issue, I feel it's best to understand the cluster layout (it's a standard VxRail setup), which is as follows:
Now, I decided to test the vSphere HA behaviour when losing vSAN networking, as I know that HA heartbeat is over the vSAN network, but wasn't 100% sure what the VM shutdown/failover actions would be.
To simulate this remotely, I logged into vCenter, ensured a single test VM was on Host 1, and then removed both physical adapters from dvSwitch B on that host. I then observed the following (let it run for 30 minutes to be sure):
When I re-added the physical adapters to dvSwitch B, everything was fine (except for needing to reset VM alarms to green). I then performed the same test with the same VM on the other hosts one at a time, and observed the same behaviour.
So I was expected to see a HA event happen on the single host which I isolated, due to the HA heartbeats going over the vSAN network, and the VM may not necessarily be on the exact host I'm "failing" at a certain point in time.
Is this behaviour normal, and I'm misunderstanding how HA should work in this scenario, or is something up with my config? I'm stumped!
I think you might have it, depping.
I attended the DC yesterday for the physical cable pull tests, and was wondering if perhaps it was happening because of the method I was using to simulate the outage vs. actual physical NIC pull.
Updated the isolation response to "Power Off" and gave it a go, but still the same result.
When I checked, however, the das.usedefaultisolationaddress and das.isolationaddress0 parameters are not set. Looks like it doesn't get set when Dell set up the VxRail for us.
We'll be adding it shortly, and will let you know the results.
If those are not configured then the default gateway is used, if that somehow can still be reached (over any network) then the isolation response is not triggered. So you need to fill out those two parameters for sure!
Yep, that's my understanding. Waiting on the client to get an IP for me to use for isolation pinging, and hopefully we'll be good to go. Another third party provider manages their network, so waiting on a change to go through the pipeline (looking at an IP on the router for the subnet/VLAN with an ACL in place allowing only pings from inside the subnet and no other traffic).