We have a one of several systems that is not behaving in the manner expected. All the systems work in the same way but this one in particular is mis-behaving...
We are required to pin host workloads into pre-defined groups. This we have done by using affinity rules to keep groups of VMs together. One VM from each group is in a anti-affinity rule, which keeps all the VM groups apart on separate hosts. This works well and is proven to work on all systems except one.
This system goes into a migration frenzy every 5 minutes (probably the periodic DRS checks).
The VMs in each group have their memory requirements "reserved", but the totals for each set fall well within the host resource capacity. The hosts have 64GB RAM each, and not of the groups have a total VM memory reservation of more that 57GB. What we are seeing on this one particular system is that even one of the groups that has a total reservation of 47GB refuses to power on the last VM, claiming there to be "not enough memory". This despite the fact the host summary showing between 50% and 75% memory utilisation and the final VM only trying to reserve 8GB!
Some of the groups will power on OK, but during one of the migration frenzies one of the group will be found to be on a different host to the rest, causing an affinity rule violation. Every 5 minutes the whole group will move to another host, and this repeats.
If we set DRS to be partially automated with the expectation that we would see what recommendations are being raised, we get nothing. We tried and waited half an hour, not a peep out of the system. As soon as we set it to fully automated again, it all kicked off with the migration frenzies again.
Comparing this system with others, we haven't been able to identify any significant configuration differences.
My feeling is that something is messed up inside vCenter (VCSA 6.7u2), and I would be tempted to deploy a new one. I would however like to understand if possible, how and why this one has gotten to the state it is in. The reason being I want to know if there is something we might have started doing that could have caused this.
Is there any experience of similar problems out there?
Not sure why you face this strange behavior. I’ve seen it only once, and it was because one of the hosts in the cluster had five active uplinks, while other hosts in the cluster had only four.
Each DRS recommendation is based on complex algorithms that try to achieve the optimum level of VM “happiness”. The recommendation details are stored in the vCenter Server as “DRM” files. There’s a website that you can use to analyze these DRM file:
Hope you find it useful.
Thanks DRS dump looks like it would be really useful, however I think our information assurance people would have kittens if I threaten to upload files to that from this system. It operates at a level of classification that precludes direct transfer of files to another network, especially the Internet. Is there any way to examine these DRM files locally?
I will get the local engineer for this system to examine the uplink configuration. As far as I know all the hosts on this system have four connections only and are all identical. Each having two 10GbE connections for networking and two 10GbE connections for iSCSI.
Hi Stephen, VMware Engineers have developed an offline tool and released it on “VMwsrs Flings” website. I haven’t tried it, but it should give you an insight about why DRS decided to do a migration:
Hope it helps.
It might have been there but seems to be gone now.
The "DRS Dump Insight" fling download link takes me to "DRS Dump Insight ", which is the same place as your first link. So I presume any offline/standalone version is no longer available. It's a shame.