vSAN Health check - Disk balance

Brainbugg · ‎07-24-2019

Hi There

When you run the vSAN Health check and a proactive Disk balance is recommended then how do you troubleshoot the Disk balance task because I've run this on several occassions but it never progresses further than 1% completion. I'm running a vSAN 6.6.1 cluster.

How long should this task take to run and what is the recommended disk thresholds?

What should the Metrics be in order to be in a Healthy state?

Average Disk Usage

Maximum Disk Usage

Maximum Variance

LM Balance Index

Thanks

Regards

TheBobkin · ‎07-24-2019

How long have you tried leaving the task running?

This task remains at 1% until it has actually started rebalancing at which point it will move to 5% and then stay there until it is done (100%) - it is not a linear progressing % that will change as it moves data, if you want to see what it is doing then monitor this from RVC using vsan.proactive_rebalance_info <pathToCluster> .

https://www.virten.net/2017/05/vsan-6-6-rvc-guide-part-2-cluster-administration/#vsan-proactive_reba...

This task runs by default for 24hrs or until it is done rebalancing.

Regarding usage - ideally you should have all disks with approximately the same usage (e.g. ~10% max variance) but this isn't always going to be possible due to factors such as: size of capacity disks, storage per node, large objects in the cluster, number of nodes in cluster available for component placement.

VMware advise leaving 30% unused space on vsanDatastore for rebalancing and re-protecting data:

Planning Capacity in vSAN

Bob

Brainbugg · ‎07-30-2019

Thanks for the feedback, I'll monitor the process more and see what happens. Currently, I chick on the rebalance at night so I'm not monitoring it. The only reason I asked the question is because whenever I check the Health checks, these warning keep on appearing.

TheBobkin · ‎07-30-2019

This is likely due to current Proactive Rebalance mechanism (with default settings) due to having a relatively low max data transfer target rate (so as not to potentially cause storage contention) and that it can be lazy with regard to the % variance reached e.g. it may just be rebalancing it past the variance threshold and then the next day the thin components grow on the other disks pushing it back past the 30% varince health check trigger. Thankfully the mechanisms and UX of this look to be improved in the future and it should be more simple to tune than it is now.

Then again, other things could potentially be causing it to be yellow/green over time: data migrations or deletions, changing of Storage Policies (especially so in smaller clusters with relatively large Objects (proportionally to their disk/Disk-Group size), relatively fast growth in some but not all vmdk or snapshot Objects, random administrators putting hosts in MM for longer than an hour (with default settings) while not being aware of vSAN and so forth.

Check how much it is moving and where, consider increasing the rate and/or lowering the threshold if you want it to move more (in less time) and more balanced but do understand that moving 50% of the data around the cluster to have disks as near balanced as possible is probably overkill (and obviously bear in mind other storage traffic - the vSAN vSphere Performance graphs are your friend here).

Follow the link I posted in the previous comment, it is basically the man page for vSAN RVC commands and shows all the configurable variables regarding proactive rebalance.

(bonus tip: default rate is 51200MB, -v switch takes decimal e.g. for 20% max variance target -v .20 )

Bob

All

vSAN Health check - Disk balance