VMware Cloud Community
qmnj
Contributor
Contributor

Rebalance Virtual SAN Cluster task stuck at 5%

I'm currently running vSAN 6.2 in our environment which is comprised of 6 hosts. All SAN health checks have passed except for the Cluster. It was giving me a SAN Disk Balance warning with the option to rebalance the disks. I ran the rebalance option and now the task is stuck at 5%. Any ideas on how to kill this task? It's been running for over 24 hours

14 Replies
MBrownWFP
Enthusiast
Enthusiast

In the same section where you triggered the rebalance (Cluster > Monitor tab > Virtual SAN > Health > Cluster > Virtual San Disk Balance) there will be a button to stop the process.

FYI I've had to rebalance a couple of times recently. As far as I remember the Task entry never moves past 5%. It stays there until rebalance is finished and then the task gets marked as Complete and eventually disappears from Recent Tasks.

I monitored rebalance progress by exporting data from the "Disk Balance" tab to Excel and calculating the total amount of data left to move. A bit clunky but it worked for me.

A "Total data to move" field in this view would be helpful.

Reply
0 Kudos
zdickinson
Expert
Expert

Good morning, that good info.  I believe you should be able to monitor the process through RVC as well.  VSAN 6.0 Part 9 - Proactive Re-balance - CormacHogan.com  Thank you, Zach.

Reply
0 Kudos
qmnj
Contributor
Contributor

Yep, I did that already and the re-balance seems to have completed already, but the task is still sitting in recent tasks. See the attached screen shot.

Reply
0 Kudos
srodenburg
Expert
Expert

The task stops automatically after 24 hours. Only then will it go to 100% / Complete. Alternatively, stop it manually with the "stop rebalance" button. But the 5% thingy stays on 5% until the job is stopped either by you or after 24 hours. It's a bug.

By the way, the slightest imbalance (even 1%) triggers the "imbalanced health alert". Duncan Epping already reported it to dev. Hopefully, these sort of "bugs" get squashed in some next release.

qmnj
Contributor
Contributor

Yea it must be a bug since I stopped the rebalance task a few days ago and the rebalance task is still showing at 5%.

Reply
0 Kudos
vpradeep01
VMware Employee
VMware Employee

Hello,

Good day !

It seems this issue is only reported so far on vCenter 6. 0 U2.

Cause:

  • The task is set to 1 percent completed when the task is created.
  • The task is set to 5 percent completed upon issue the command to rebalance the cluster.
  • It then wait for the rebalance to complete before setting the percent done to 100.
  • During the waiting period, it will check to see if rebalance is done (via clom-tool command). If not done, it will sleep for 100 seconds and check again if rebalance is done.
  • The logic to update the percentage completed is not implemented yet. Therefore, the task is stuck as 5% until it is completed which will then set to 100%.

By default when triggered from the VC UI, the task will run for 24 hours or whenever the rebalance effort is done, whichever comes first.

Workaround:

  • In order to kill this task when it is stuck, you need restart the vpxd and health service on all the hosts ( /etc/init.d/vmware-vsan-health restart )
  • Restarting vpxd service will clear the rebalance task that is stuck at the UI and restarting vsan-health service after vpxd restart will prevent future rebalance task been stuck for days (UI side).
  • Use rvc - vsan.disks_stats to current disk usage.

Resolution:

There is no resolution for this issue as of now.

The fix will be addressed at vCenter server side mostly involving the health service plugin. Hopefully in 6.0 U3

Reply
0 Kudos
gustavocpw
Contributor
Contributor

same problem here... synchronization has apparently ended (no more Warning in Monitor Tab), but the task stuck in 5%.

I'll try the vpradeep01 workaround

Thanks!

Reply
0 Kudos
vpradeep01
VMware Employee
VMware Employee

Sure.


Correction:

In order to kill this task when it is stuck, you need restart the vpxd and health service on vcenter server else reboot the vc.

Reply
0 Kudos
MBrownWFP
Enthusiast
Enthusiast

The task will clear on its own. Not ideal and hopefully this is streamlined in future releases.

I would much rather let the task entry run its course than restart services on servers running production workload. (This of course assuming there is production workload running)

Thanks,

Matt

Reply
0 Kudos
GreatWhiteTec_x
Contributor
Contributor

This should help clarify what's going on. https://greatwhitetec.com/2016/10/12/vsan-proactive-rebalance/

Reply
0 Kudos
FM19999999
Enthusiast
Enthusiast

I happen to try putting one host in maintenance mode and then putting it back. Looks like it cleared it and the task completed.

I'm running all 6.0 and 6.2

Reply
0 Kudos
GreatWhiteTec
VMware Employee
VMware Employee

Thanks for the reference. Yes... when you kick the proactive rebalance you are opening a 24 hour window for the rebalance to rake place. Use RVC to track progress. UI does not refresh this status. Automatic rebalance kicks in when you hit 80% of drive capacity. This can be changed under advanced settings for each host.

Reply
0 Kudos
ironman13
Contributor
Contributor

you can view state by rvc command vsan.proactive_rebalance_info 0

Reply
0 Kudos
hkg2581
VMware Employee
VMware Employee

Please upgrade your vcenter to the latest available patch for 6.0 Udpate 3 and hosts to the latest 6.0 Update 3 patch 06 which has addressed this problem with few other critical fixes .

KB2146345 - ESXi host experiences a PSOD due to a vSAN race condition

KB2145347 - Component metadata health check fails with invalid state error

KB2150189 - vSAN de-staging may cause a brief PCPU lockup during heavy client I/O

KB2150395 - Bytes to sync values for RAID5/6 objects appear incorrectly in vCenter and RVC

KB2150396 - Using objtool on a vSAN witness node may result in a PSOD

KB2150390 - Health check for vSAN vmknic configuration may display a false positive

KB2150389 - SSD congestion may cause multiple virtual machines to become unresponsive

KB2150387 - vSAN Datastores may become inaccessible during log or memory congestion

KB2151127 - vsan and Vmware Boot bank critical fix

KB2151132​ - vsan and Vmware Boot bankcritical fix

Thanks, Hareesh K G Personal Blog : http://virtuallysensei.com
Reply
0 Kudos