kmcd03
Contributor
Contributor

vSAN stretched cluster preferred/secondary fault domain designation out of synch with storage policy's preferred/secondary affinity rule

Jump to solution

Has anyone seen a problem where changing the Preferred Fault Domain designation in a stretched vSAN cluster causes the Affinity rule for Preferred/Secondary in a Storage Policy to become out of synch?

I have a new vSAN 6.6 stretched cluster with 6 hosts in one data center (DC-North) and 6 hosts in another (DC-South).  During the initial install I designated DC-South as the Preferred Fault Domain. 

I created a storage policy for each DC with Primary Failures to Tolerate = 0.  The DC-South policy had PFTT=0 and Affinity rule "Primary Fault Domain".  And another storage policy for DC-North had PFTT=0 and Affinity=Secondary.

Last week I changed the preferred designation in the "Fault Domain & Stretched Cluster" section so hosts in the DC-North fault domain are now Preferred.

I also updated the Affinity rule in the storage policies.  So DC-North storage policy is Affinity=Preferred and DC-South Affinity=Secondary.

I re-applied the policies to the VMs, but the VMs are storing their objects on hosts in the wrong fault domain.  E.g. VM is running on hosts in Preferred (DC-North) hosts, but objects stored on disks at Secondary (DC-South).  And vice versa for VMs running at Secondary but stored at Preferred. 

I've created new storage policies, cloned and update the originals, built new VMs with both new and existing policies, etc.  But the Affinity rule for Preferred/Secondary in Storage Policy does not match the vSAN cluster setting for Preferred/Secondary.

I have a ticket open with GSS.  Thought I would ask communities if this has been seen before.

Thanks!

0 Kudos
1 Solution

Accepted Solutions
TheBobkin
VMware Employee
VMware Employee

Hello kmcd03

Had mix of luck reproducing this in a 3+3+1 HOL cluster but looks like you may be onto something.

I have a sneaky 5-second workaround that does appear to resolve it where I tested it (as it appears to with other SPBM scenarios where it states compliant with the SP when it is clearly not):

Edit SP, Add Rule that doesn't change the structure of the LSOM-components (e.g. IOPS Limit=0), you should be prompted to apply now or manually later, apply now if you want to apply to all (careful) or apply manually VM by VM (or a few at a time in batches from Cluster > VMs > Select VM, shift-click, Select more, Edit VM Policy)

I can only state that this is 'non-disruptive' if you are aware and expectant of data being copied over to the other site and are prepared (e.g. don't do all at once during cluster peak-hours while running back-ups with inadequate space per site). You can of course remove the Rule (e.g. IOPS Limit=0) after.

Bob

View solution in original post

0 Kudos
5 Replies
TheBobkin
VMware Employee
VMware Employee

Hello kmcd03

Had mix of luck reproducing this in a 3+3+1 HOL cluster but looks like you may be onto something.

I have a sneaky 5-second workaround that does appear to resolve it where I tested it (as it appears to with other SPBM scenarios where it states compliant with the SP when it is clearly not):

Edit SP, Add Rule that doesn't change the structure of the LSOM-components (e.g. IOPS Limit=0), you should be prompted to apply now or manually later, apply now if you want to apply to all (careful) or apply manually VM by VM (or a few at a time in batches from Cluster > VMs > Select VM, shift-click, Select more, Edit VM Policy)

I can only state that this is 'non-disruptive' if you are aware and expectant of data being copied over to the other site and are prepared (e.g. don't do all at once during cluster peak-hours while running back-ups with inadequate space per site). You can of course remove the Rule (e.g. IOPS Limit=0) after.

Bob

View solution in original post

0 Kudos
kmcd03
Contributor
Contributor

Thanks for reply.  Unfortunately workaround didn't fix the problem.  I changed storage policy by adding IOPS limit = 0 and did an update.  But the VM objects location does not move/change.

If I change the Affinity Preferred/Secondary back and forth, the objects move between hosts in the fault domain.  But the Preferred/Secondary designation in the Storage Policy Affinity is opposite than the setting observed in vCenter.

I've also ran the command  "esxcli vsan cluster preferredfaultdomain get" on the hosts and witness, and all reporting preferred fault domain is same as vCenter Server Web Client.

This is a new environment with just few test and non-prod VMs running, so do have an opportunity to try things.  I've been doing some failure testing of this new cluster prior to moving legacy production VMs.  We simulated a cut on the 10Gb connection between data centers used by vSAN.  This did work previously, but last week changed the Preferred Fault Domain designation and discovered this problem.

Thanks!

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Hello kmcd03​,

"But the VM objects location does not move/change.

If I change the Affinity Preferred/Secondary back and forth, the objects move between hosts in the fault domain.  But the Preferred/Secondary designation in the Storage Policy Affinity is opposite than the setting observed in vCenter."

Okay so just to clarify: are ALL Objects being stored on diametric (from what you expect) Fault Domains? If so then set it the opposite: Naming Storage Policies after Fault Domains with affinity to Preferred/Secondary doesn't make much sense if you are going to change this rule at the Storage Policy level, e.g. I could make a Fault-Domain called Site-A, make it Preferred, call the Storage Policy 'Site-A' affinity to (currently) Site-A (hosts1-6), change Site-B to Preferred and now all data is on Site-B with a Storage Policy called 'Site-A'.

Or, do you have a mix and match e.g VM-A with same Storage Policy applied to all disks but half/some stored on Site-A, half/some stored in Site-B?

I ask as what you described kept sounding like a double-negative e.g. changing Preferred from Site-A to Site-B (data moves as the preferred site has changed) but then changing the site-affinity of the Preferred/Secondary Storage Policies (and reapplying) should change where the data is stored.

"If I change the Affinity Preferred/Secondary back and forth, the objects move between hosts in the fault domain. "

Do you mean the cluster Preferred or in the Storage Policy? And are you seeing a different effect in these? e.g. changing Preferred (cluster-level) moves the data but changing the Storage Policies rules Preferred/Secondary does not move the data(when reapplied)?

"This is a new environment with just few test and non-prod VMs running, so do have an opportunity to try things.  I've been doing some failure testing of this new cluster prior to moving legacy production VMs.

As with any infrastructure, do test the hell out of it before it is Production :smileygrin:

Bob

0 Kudos
kmcd03
Contributor
Contributor

Sorry, I'm probably over describing the problem.

The Affinity setting in the Storage Policy for Preferred/Secondary fault domain is diametric from the designated Preferred fault domain setting configured in the clusters Configure | "Fault Domain & Stretched Cluster" section of vCenter Server Web Client.

I could leave it as is and accept that the site in the Affinity rule is just the opposite.  But my concern is the risk that a future change, like a patch or upgrade, corrects this problem and the VMs objects are moved unintentionally.

I'm guessing I will have to disable the stretched cluster setting and re-configure to designate the fault domain I want to be preferred.

The goal was to have the ability to pin VMs to specific sites using Storage Policies and DRS VM-to-Host rules.  There would be a storage policy with PFTT=0 and Affinity=Preferred for Site-A.  And a second storage policy with PFTT=0 and Affinity=Secondary for Site-B.  The storage policies would allow me to keep VM (all objects) local to a site.  And the DRS rules would pin the VMs to hosts at a site.

I will be migrating existing production VMs into this new vSAN cluster.  These VMs do not have a stretched layer-3 networking, so must be pinned to specific site.  And there are Test/Dev VMs that do not require replication between sites.  Next phase will be to deploy NSX to stretch the network for VMs.  Can then fully take advantage of a storage policy with PFTT=1 (and Affinity not relevant).

Eventually there is a use case in our environment for wanting to change the Preferred fault domain designation.  We are required by vendor to demonstrate DR every six months.  Having the workloads move between sites will satisfy that requirement.  Being able to change the Preferred designation would help us balance the workload across sites and still maintain continuity in a network outage.

Thanks.

0 Kudos
kmcd03
Contributor
Contributor

Follow up if anyone else encounters similar problem of Affinity rule for Preferred/Secondary in a storage policy not following the Preferred fault domain designation in a stretched cluster.  With help of GSS was able to identify a duplicate entry for preferred fault domain in CMMDS.  Object was removed and Affinity now aligns with the cluster setting for Preferred/Secondary.  In my case the duplicate entry referenced a non-existent witness host that was used for initial, but failed configuration of stretched cluster.  The witness was successful in communicating to ESXi hosts at layer-3, but ESXi hosts failed to communicate to witness because of asymmetric route.  The routing problem was corrected and stretch configured with new witness. I suspect removing the witness used in first attempt did not successfully remove object from CMMDS so Affinity in Storage Policy was detached from the cluster designation for Preferred.

0 Kudos