WeiShen2020
Contributor
Contributor

DRS PROBLEM: VM is kept being migrated between two ESXi hosts after a network outage

Hi Guys,

I have a vSphere 5 cluster with 10 ESXi Node, and vCenter is 5.1. There is a network outage days ago. And right after that, I noticed RDS became super 'smart'.

1) Whenever I migrate a VM to a different host, it'll be migrated back when the next automatic RDS check runs (every 5 minutes by default).

2) There is a VM keeping migrating between two hosts. The first RDS RUN migrates it to Host-B from HOST-A, and the next RDS RUN will migrate it back to HOST-A.

* RDS Automation Level = Fully Automated, value is in the middle (level 3: Apply priority 1, 2 and 3 recommendations)

Is this because the vCENTER has a corrupted RDS balancer data during the network outage? How to fix this?

Thanks.

Tags (1)
0 Kudos
9 Replies
tedg_vCrumbs
Enthusiast
Enthusiast

Do you have any DRS affinity rules or anti affinity rules that could be causing this or conflicting?

------ tedg Don't forget to mark posts as helpful or correct if they deserve it!
0 Kudos
WeiShen2020
Contributor
Contributor

Thanks Ted. I don't.

And actually Affinity isn't going to make that VM keep migrating between two hosts all the time.

0 Kudos
WeiShen2020
Contributor
Contributor

anyone could help?

thanks

0 Kudos
Alistar
Expert
Expert

Hi There,

I guess what could help is resetting the Fault Domain Managers (fdm in short - basically it is a "DRS daemon") on each of your ESXi hosts inside that cluster. Be sure to do it one by one:

  • Right-Click the ESXi host and select "Reconfigure for vSphere HA" and wait until the fdm reinstalls
  • After that is done try running DRS again

If that does not work, perhaps vCenter would have its cluster table corrupt (because the DRS calculations themselves are done in vCenter), so try moving around with your HA admission control, setting a different DRS strategy/aggresivity (try manual) and save the options, then apply the default settings again.

If all fails, I guess all you could do is create a brand new cluster and move your ESXi hosts there - or delete and re-create the current one. If it's inside the vCenter DBs these are the options you could do without VMware support in my opinion.

Good luck!

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
0 Kudos
WeiShen2020
Contributor
Contributor

Hi Alistar,

For your first option, actually I have rebooted all ESXi nodes one by one. And it didn't solve the problem. Shall I still try your option 1, or reboot did the same thing already?

And during the reboot process, I changed the DRS setting from fully automated to Partially Automated, and switched it back. But this didn't solve my problem either.

Thanks.

0 Kudos
Alistar
Expert
Expert

Hi WeiShen,

putting the host to maintenance mode and back reinstalls and reinitiates the fdm on ESXi hosts, so if you rebooted with putting the host in maintenance mode, I guess this was done already.

Can you please post vmkernel.log, fdm.log and vmkwarning.log from your ESXi host here after a problematic DRS migration took place? In the worst case someting could have gone corrupt in the database and messed up the calculations of DRS which could be solved by estabilishing a new cluster - but let's see if we can find something out in the logs.

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
0 Kudos
WeiShen2020
Contributor
Contributor

Actually it has been good for less than a day, after switching DRS mode to Partial (and switch back) and reboot ESXi hosts. But after less than 20 hours, it came back again.

I have logs attached. Thanks for your help.

0 Kudos
Alistar
Expert
Expert

Hi WeiShen and thank you for the logs,

can you please state which VM is getting migrated back and forth between the ESXi hosts? I'll take a look in the meantime...

edit: I can see quite a few migrations of SYDSQL05 - every 10 minutes from 04:24 to 05:16 GMT.

can you please post the DRS logs found on the location in the following KB? VMware KB: Location of vCenter Server log files There is no reason given in the vmkernel and fdm logs and I guess we will really find the reason for crazy DRS there.

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
0 Kudos
WeiShen2020
Contributor
Contributor

Hi Alistar,

Yes, it's SQL05. and I have the VPXD log attached.

I saw many below error. Is this in UTC timezone? If yes, why it's +11:00. confusing.

2015-04-02T07:48:11.251+11:00 [05756 warning 'vpxdoverheadMemory' opID=2A8DE6BF-00005DF4-74-95-20-85-ea-e3] [VmMo::GetMemoryOverheadInt] VM SYDSQL05: Use overheadMax stat (235) from host 37 while the VM is on host 147

And I saw ~386 x  "cluster is imbalanced".

Thanks.

0 Kudos