I have a vSphere 5 cluster with 10 ESXi Node, and vCenter is 5.1. There is a network outage days ago. And right after that, I noticed RDS became super 'smart'.
1) Whenever I migrate a VM to a different host, it'll be migrated back when the next automatic RDS check runs (every 5 minutes by default).
2) There is a VM keeping migrating between two hosts. The first RDS RUN migrates it to Host-B from HOST-A, and the next RDS RUN will migrate it back to HOST-A.
* RDS Automation Level = Fully Automated, value is in the middle (level 3: Apply priority 1, 2 and 3 recommendations)
Is this because the vCENTER has a corrupted RDS balancer data during the network outage? How to fix this?
Do you have any DRS affinity rules or anti affinity rules that could be causing this or conflicting?
I guess what could help is resetting the Fault Domain Managers (fdm in short - basically it is a "DRS daemon") on each of your ESXi hosts inside that cluster. Be sure to do it one by one:
If that does not work, perhaps vCenter would have its cluster table corrupt (because the DRS calculations themselves are done in vCenter), so try moving around with your HA admission control, setting a different DRS strategy/aggresivity (try manual) and save the options, then apply the default settings again.
If all fails, I guess all you could do is create a brand new cluster and move your ESXi hosts there - or delete and re-create the current one. If it's inside the vCenter DBs these are the options you could do without VMware support in my opinion.
For your first option, actually I have rebooted all ESXi nodes one by one. And it didn't solve the problem. Shall I still try your option 1, or reboot did the same thing already?
And during the reboot process, I changed the DRS setting from fully automated to Partially Automated, and switched it back. But this didn't solve my problem either.
putting the host to maintenance mode and back reinstalls and reinitiates the fdm on ESXi hosts, so if you rebooted with putting the host in maintenance mode, I guess this was done already.
Can you please post vmkernel.log, fdm.log and vmkwarning.log from your ESXi host here after a problematic DRS migration took place? In the worst case someting could have gone corrupt in the database and messed up the calculations of DRS which could be solved by estabilishing a new cluster - but let's see if we can find something out in the logs.
Hi WeiShen and thank you for the logs,
can you please state which VM is getting migrated back and forth between the ESXi hosts? I'll take a look in the meantime...
edit: I can see quite a few migrations of SYDSQL05 - every 10 minutes from 04:24 to 05:16 GMT.
can you please post the DRS logs found on the location in the following KB? VMware KB: Location of vCenter Server log files There is no reason given in the vmkernel and fdm logs and I guess we will really find the reason for crazy DRS there.
Yes, it's SQL05. and I have the VPXD log attached.
I saw many below error. Is this in UTC timezone? If yes, why it's +11:00. confusing.
2015-04-02T07:48:11.251+11:00 [05756 warning 'vpxdoverheadMemory' opID=2A8DE6BF-00005DF4-74-95-20-85-ea-e3] [VmMo::GetMemoryOverheadInt] VM SYDSQL05: Use overheadMax stat (235) from host 37 while the VM is on host 147
And I saw ~386 x "cluster is imbalanced".