Solved: Re: vSAN 6.6 Reactive rebalance slow

BenjaminHinkle · ‎05-25-2017

I'm working on an environment that is exhibiting a troubling issue. The cluster is made up of the following:

vSAN 6.6 (upgraded from 6.5)

5 hosts w/

12 x 4TB spindles

2 x 2TB NVMe PCIe cards

10Gb vSAN network

Standard vSAN storage policy with FTT=1 and very little reserved cache

Datacenter virtualization only

There are three significant (30TB+ each) workloads in this cluster along with various other typically sized workloads.

The cluster is working fantastic from a day to day performance perspective. However, when the workloads were initially populated, things became VERY un-balanced and a significant amount of congestion was introduced (though the end users did not notice an impact). CLOMD additionally had issues on one of the hosts which compounded the balance trouble. Much of the CLOMD issues have been resolved, and we ended up removing a host and allowing the resync to rebuild the redundancy (at the request of Support). The removed host was then re-added and allowed to sync. Proactive re-balance was initiated to start to get the added host balanced.

However, one of the other hosts remains unbalanced to the point of initiating Reactive Balancing. My first theory was that CLOMD was causing issues on that host too, but it's definitely running. It seems that the Reactive Balancing is extremely slow and unable to keep up with change on the disks (whereas Proactive seems to make headway). In fact, I see the Data To Be Moved seems to be increasing on those disks. It's becoming a bit of a concern as one of those disks only has 2% free space now and 8 of the 12 disks on the one host show capacity warnings.

So, here are my questions; is there a way to prioritize the rebalancing (particularly the reactive version)? Does anyone have suggestions on freeing up space on that one, unbalanced host? It seems as though when a disk is reactive balancing, it does not participate in proactive rebalancing.

Any suggestions would be appreciated. Thanks all!

TheBobkin · ‎05-26-2017

Hello Benjamin,

Thanks for clarifying.

This is not a new feature, automatic rebalancing when 81% reached on disk has been a part of vSAN for a long time but was only added to the GUI in 6.6 (along with some changes to the actual process).

Rebalance (of any kind) is a slow process, it really minimizes transfer rate so as not to impact normal cluster operations.

You can however specify how much data you would *like* it to transfer per hour per node using the '--rate-threshold' (or '-r') option on vsan.proactive_rebalance , followed by a number in MB.

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

View solution in original post

TheBobkin · ‎05-25-2017

Hello Benjamin,

Apologies, but what do you mean by "reactive balancing", are you referring to resync?

Disks ideally shouldn't be getting near full when other disks are low but a few disks is not as important as would be a single host or disk-group filling up.

Would you mind attaching the output of a few RVC commands so we can better gauge the issue here?

> vsan.disks_stats <pathToCluster>

> vsan.proactive_rebalance_info <pathToCluster>

> vsan.resync_dashboard <pathToCluster> (just need the last line "Total" of this)

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

BenjaminHinkle · ‎05-25-2017

Reactive Rebalancing was added in vSAN 6.6 to react to disks being at 80% of capacity or more. See here: Storage and Availability Technical Documents

It shows up in the vSAN health checks under Physical Disk Capacity. See the attached screenshot.

That said, I've attached the rvc outputs. As you can see, the 3rd host is the culprit.

TheBobkin · ‎05-26-2017

Hello Benjamin,

Thanks for clarifying.

This is not a new feature, automatic rebalancing when 81% reached on disk has been a part of vSAN for a long time but was only added to the GUI in 6.6 (along with some changes to the actual process).

Rebalance (of any kind) is a slow process, it really minimizes transfer rate so as not to impact normal cluster operations.

You can however specify how much data you would *like* it to transfer per hour per node using the '--rate-threshold' (or '-r') option on vsan.proactive_rebalance , followed by a number in MB.

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

BenjaminHinkle · ‎05-26-2017

Thanks Bob! I didn't realize the reactive rebalance existed prior to 6.6. Was/is there an rvc command to view progress? At first glance, it doesn't appear to show it in the vsan.proactive_rebalance_info output.

It's funny you bring up the transfer rate switch. While fiddling around last night, I saw the -r switch and wondered if it affected the reactive rebalance process. Particularly since the RR transfers are not listed in the "info" output. It this switch different from the new GUI throttle slider and allow you to increase speed?

At any rate, I ended up cheating to solve the problem last night. By adding an additional disk to the known offending VM, and transferring required files followed by deleting the old disk, I was able to free up enough space on the overloaded physical disks. This allowed at least SOME control over those disks, and removed the impending threat of 100% saturation.

I would make the request that more information be available for this "emergency rebalance", as that's what I consider it. It doesn't seem to be able to keep up with even a moderate amount of change. All told, in 24 hours, the offending disks only moved 50MB each. I would expect this rebalance to work to get things moved "at all costs". Particularly when you're 2% from capacity.

I appreciate the help Bob. Things are under control in this environment. However, I have concerns about what happens when this occurs in the future with this customer. They like moving very very large datasets for this size environment. I think vSAN is having trouble balancing 1/3 of capacity on the fly.

Thanks again!

furryhamster · ‎11-06-2017

Pretty sure the original question is right. There are both "REACTIVE" and "PROACTIVE" balancing in 6.6. Proactive kicks in at 80% and from what I gather, reactive kicks in much earlier, when the difference between two objects is 30%. I'm actually trying to find out more information myself.

TheBobkin · ‎11-06-2017

Hello furryhamster,

Just to clarify what I said before (only 6 months ago - feels like a life-time!) and with regard to your comment:

- "reactive" rebalance occurs when any capacity-tier drive passes 80% utilized space (IIRC coded as 81% utilized), this aims to redistribute components on to disks that are comparitively under-utilized, targeting the lowest utilized as priority.

A caveat to this however is that respecting fault domains (FD) as they apply to being compliant with the storage policy (SP) takes precedent; so for instance having an 8TB vmdk with FTT=1 (FTM=RAID1) on a 30TB storage 3-node cluster may end up being very imbalanced with no means of redistributing the components (as it would breach compliance of the SP to split a mirror over multiple FDs).

- "proactive" rebalance also works on a disk-to-disk storage utilization comparison (not objects as you suggested). This will rebalance disks when initiated (via Health check button or RVC) with the default being to check for difference in utilization of 30% or more between disks, this process does not start automatically.

More information on vsan.proactive_rebalance its switches and their function can be found here:

virten.net/2017/05/vsan-6-6-rvc-guide-part-2-cluster-administration/

Bob