Re: vSAN & vMotion Network Down and vSphere HA Beh...

dstuehb · ‎10-06-2022

Hi All,

I'm at the tail end of setting up a new cluster for a customer, and while planning for some cable-pull testing ran into an unexpected behaviour, so hoping to get some insight into it.

Before I explain the issue, I feel it's best to understand the cluster layout (it's a standard VxRail setup), which is as follows:

Versions (older due to compatibility matrices at the time of deployment, not updated yet):
- vCenter 7 U3d
- ESXi 7 U3d
5 x VxRail hosts
- 4 x 25GB NICs
- Flash & Spindle for vSAN
1 x Cluster
- vSAN configured
  - Using a storage policy for R5 storage, 1 x stripe, thin, dedupe & compression enabled
- DRS enabled, fully automated, level 3, default settings
- vSphere HA enabled with
  - Host Failure Response: Restart VMs
  - Response for Host Isolation: Shut down and restart VMs
  - Datastore with PDL: Power off & restart VMs
  - Datastore with APDL: Power off & restart VMs - Conservative
  - Admission control, 1 x host failure tolerated, cluster resource percentage (auto), auto heartbeat datastores
6 x datastores
- 1 x vSAN in cluster
- 5 x local to hosts and unused
2 x dvSwitch
- dvSwitch A - 2 x uplinks, 4 x vDistributed Port Group
  - vDPG A = Management VLAN & VMkernel
  - vDPG B = VxRail VLAN & VMkernel
  - vDPG C = VCSA VM (same VLAN as mgmt)
  - vDPG D = Production VMs VLAN
- dvSwitch B - 2 x uplinks, 2 x vDPG
  - vDPG E = vMotion VLAN & VMkernel (uplink failover order has Uplink 1 active, 2 standby)
  - vDPG F = vSAN VLAN & VMkernel (uplink failover order has Uplink 2 active, 1 standby)

Now, I decided to test the vSphere HA behaviour when losing vSAN networking, as I know that HA heartbeat is over the vSAN network, but wasn't 100% sure what the VM shutdown/failover actions would be.

To simulate this remotely, I logged into vCenter, ensured a single test VM was on Host 1, and then removed both physical adapters from dvSwitch B on that host. I then observed the following (let it run for 30 minutes to be sure):

Expected:
- Host 1 was still accessible for management, and went into vSAN cluster partition 2, others remained in 1
- vSAN resync timer kicked in
- Various alarms in vCenter for the cluster and hosts regarding loss of connectivity to a vSAN host, HA, etc.
- Various alarms on Host 1 regarding loss of network redundancy etc.
Unexpected:
- The VM on Host 1 remained running, and did not power off and failover to any other host
  - Strangely, it was accessible the entire time responding to pings, but vCenter reported VMware tools running, not running, running, not running, etc. cycling constantly
  - Also strangely, when looking at the VMs tab for Host 1, the VM occasionally disappeared but then reappeared immediately with a refresh
- All of the VMs which were not on Host 1 threw an alarm stating "vSphere HA virtual machine failover failed"

When I re-added the physical adapters to dvSwitch B, everything was fine (except for needing to reset VM alarms to green). I then performed the same test with the same VM on the other hosts one at a time, and observed the same behaviour.

So I was expected to see a HA event happen on the single host which I isolated, due to the HA heartbeats going over the vSAN network, and the VM may not necessarily be on the exact host I'm "failing" at a certain point in time.

Is this behaviour normal, and I'm misunderstanding how HA should work in this scenario, or is something up with my config? I'm stumped!

Cheers

muakhtar · ‎10-12-2022

Hi,

Kindly check the issue with the vxrail support team.

Munib Akhtar
VCP-DCV/VCP-DTM/VXRAIL
Please mark help full or correct if my answer is use full for you

depping · ‎10-12-2022

What is the Isolation Address you configured? And did you disable the default isolation address?

depping · ‎10-12-2022

Also, isolation response should be "power off" as "shutdown" will not work anyway when access to storage is lost.

dstuehb · ‎10-12-2022

I think you might have it, depping.

I attended the DC yesterday for the physical cable pull tests, and was wondering if perhaps it was happening because of the method I was using to simulate the outage vs. actual physical NIC pull.

Updated the isolation response to "Power Off" and gave it a go, but still the same result.

When I checked, however, the das.usedefaultisolationaddress and das.isolationaddress0 parameters are not set. Looks like it doesn't get set when Dell set up the VxRail for us.

We'll be adding it shortly, and will let you know the results.

depping · ‎10-17-2022

If those are not configured then the default gateway is used, if that somehow can still be reached (over any network) then the isolation response is not triggered. So you need to fill out those two parameters for sure!

dstuehb · ‎10-17-2022

Yep, that's my understanding. Waiting on the client to get an IP for me to use for isolation pinging, and hopefully we'll be good to go. Another third party provider manages their network, so waiting on a change to go through the pipeline (looking at an IP on the router for the subnet/VLAN with an ACL in place allowing only pings from inside the subnet and no other traffic).

All

vSAN & vMotion Network Down and vSphere HA Behaviour