VMware Cloud Community
maslow
Enthusiast
Enthusiast
Jump to solution

vSphere HA Should Run in vSphere 7 (formerly das.respectvmvmantiaffinityrules)

Hi colleagues 🙂

I have the following problem ... We have a 3 node ESXi 7 cluster and running 3 VMs within this cluster, which need to be separated to 3 individual hosts. But in case of host failure or maintenance the 3rd VM should run on the two remaining hosts and once host is back in production, it should get separated again.

So far so easy 🙂 Setting up a VM-VM anti affinity rule containing those 3 VMs and they get balanced across 3 hosts. Check!
But if now a host fails, the VM stays powered off as no sufficient resources are available. 
So far I learned, that HA interprets those separate rules by default as must run, so this explains, why I receive this resource warning and VM stays powered off.

Now vSphere documentation to the rescue:
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.avail.doc/GUID-E0161CB5-BD3F-425F-A...
HA advanced setting: das.respectvmvmantiaffinityrules

Default is true, so I set it to false, to achieve my goal, but it still doesnt work. So I contacted VMware support and they sent me this kb article:
https://kb.vmware.com/s/article/2033250?lang=en_US

So according to them, this HA advanced setting is no longer valid in vSphere 7 😞 I asked for an alternate solution, but yet received no reply. The only workaround that came to my mind is:
- Setting up 3 host groups, each group contains one host
- Setting up 3 VM groups, each group contains one VM
- Create a VM-Host affinity rule as should run
This may solve my problem, but we need this for a customer having various 3 node VM clusters that need this solution, so I would end up with huge amount of host and VM groups along with additional VM-Host affinity rules ...
Hope you understand what I want to achieve and maybe have an idea ❤️

Labels (3)
0 Kudos
1 Solution

Accepted Solutions
depping
Leadership
Leadership
Jump to solution

I just reproduced it indeed in the lab with 7.0 u3, and this is indeed a known issue. None of those will work with the current architecture unfortunately. The placement engine doesn't take those advanced settings into account.

View solution in original post

0 Kudos
13 Replies
depping
Leadership
Leadership
Jump to solution

Can you give this setting a try:

  • das.treatVmVmAntiAffinityAsSoft = true
0 Kudos
maslow
Enthusiast
Enthusiast
Jump to solution

Hi Duncan, thanks for the reply. Doesnt seem to work. If I reset the host with the 3rd VM, afer some time host goes to "not responding" and the VM shows "disconnected", in cluster events it is logged:

"vSphere HA failover operation in progress in cluster CGI_Test_Cluster in datacenter Bremen: 1 VMs being restarted, 1 VMs waiting for a retry, 0 VMs waiting for resources, 0 inaccessible vSAN VMs".

The failed VM stays disconnected until the host is rebooted and back online 😕 It doesnt get reconnected or  restarted on another host. On the VM there is a red event: vSphere HA virtual machine failover failed.

After the host is back online, the VM stays offline and gets not powered up again!

 

If you may, you can have a look at SR# 23430235805 for vCenter and ESXi logs.

We are running vCenter 7.0.3 Build 21477706 and ESXi 7.0.3 Build 21686933 with vSAN.

0 Kudos
StephenMoll
Expert
Expert
Jump to solution

We had issues with the das.* advanced affinity rule settings after moving our system from vSphere 6 to vSphere 7.

I believe all the settings now default to the hardest by default, and provided you have a vCLS appliance active in the cluster rules will be respected. If however for some reason you manage to knock out all the vCLS, which I believe to be somewhat too easy**, then the old fashioned HA placement engine inside ESXi kicks in, and I believe this unfortunately this no longer can be affected by the settings of the das.* advanced settings, and treats them as soft. So in the time it takes for vCLS appliances to recover, the old HA functionality can place restarted VMs wherever it likes.

 

** Human nature would tend to dictate a methodical nature to how systems are started up. Like reading book, a person will start with the left most rack, and power servers on left to right, top to bottom. 

The first three hosts in a cluster will get the vCLS appliances, leading to them often being physically clumped together. I.e. the top most three servers in the first rack being powered in. So if that single rack suffers a failure, then you could conceivably lose all vCLS simultaneously. I have suggested to our TAM that VMware should think about some sort of mechanism to allow users to disperse vCLS automatically. For example by defining host groups based on geographic/physical location and having a setting that says vCenter should maintain vCLS dispersion in a manner that prevents more than 1 vCLS appliance being in one of these host groups at a time. I would even go as far as to suggest that for system operators with very stringent resilience requirements, the ability to specify how many vCLS appliances to run would be nice, even to the point of having one on every host if required. That is a setting I would employ.

Tags (1)
0 Kudos
depping
Leadership
Leadership
Jump to solution

Weird, I am testing it in my lab (8.0 though), and the VM with anti-affinity gets restarted next to one of the VMs it isn't supposed to be next to.

Screenshot 2023-05-15 at 16.33.06.png

0 Kudos
maslow
Enthusiast
Enthusiast
Jump to solution

Is maybe vSAN the problem here? With default storage policy. The VM stays disconnected until the 3rd host is back online and available, then gets rebooted on it with the advanced setting set you mentioned.

Tested it today with creating 3 host and 3 VM groups, then 3 vm-host shouldrun rules, this works fine, but is a mess if looking at thr DRS cluster config 🙂 Just setting up a simple rule and having the advanced setting set would be nice.

0 Kudos
depping
Leadership
Leadership
Jump to solution

I use vSAN as well, and for me the the anti-affinity works as expected, I am not sure what causes, maybe support knows, could be you are hitting a bug,

0 Kudos
maslow
Enthusiast
Enthusiast
Jump to solution

Hm ok... will wait for their log analysis and will post the outcome here 🙂

0 Kudos
StephenMoll
Expert
Expert
Jump to solution

I have a email thread record between myself and our TAM on a bug around das.respectvmvmantiaffinityrules.

When we had problems I asked if the 'problem' alluded to in the topic below was going to be fixed:

vSphere HA ignores das.respectvmvmantiaffinityrule... - VMware Technology Network VMTN

In my queries about progress on the promised 2022Q3 fix, I have not seen any release notes stating the problem has been resolved, or received any confirmation that the work was even being progressed. The last email I had on the topic was that our TAM was not very happy with the response to his query, which I assume meant that it wasn't progressing.

Tags (1)
0 Kudos
depping
Leadership
Leadership
Jump to solution

I will setup a lab with version you run and try to figure it out what the issue is here, weird.

maslow
Enthusiast
Enthusiast
Jump to solution

Thanks, much appreciated 🙂

0 Kudos
StephenMoll
Expert
Expert
Jump to solution

For reference our specific scenario involved a two host cluster.

Host-1 and Host-2.

Each of these had a set of Windows WSFC clustered VMs. This was the bulk of the work load, and the WSFC VMs fort Host-1 had anti affinity with all the WSFC VMs on Host-2, and vice-versa.

Then there were a handful of 'floating' VMs, i.e. single instances that would be recovered by HA if one of these hosts failed. This worked brilliantly under vSphere 6.x, but all went to pot when we moved onto vSphere 7. It was quite simply down to the advanced HA das.* settings not having any effect. 

When one of the hosts failed then a few of the WSFC VMs would be recovered to the surviving host, using up the resources that should have been used by the floating VMs, stopping them from starting. We worked around the issue, using other features, which has restored the behaviours we want, but it does mean this cluster configuration is now different to our other clusters. But meh! No biggie.

 

0 Kudos
depping
Leadership
Leadership
Jump to solution

I just reproduced it indeed in the lab with 7.0 u3, and this is indeed a known issue. None of those will work with the current architecture unfortunately. The placement engine doesn't take those advanced settings into account.

0 Kudos
maslow
Enthusiast
Enthusiast
Jump to solution

Oh .. damnit, so is the only workaround using DRS host and VM groups with ShouldRun rules or will there be another workaround or fix? Or is one forced to update to vSphere 8? ;>

0 Kudos