Virtual SAN Disk Balance

sandroalvesbras · ‎03-23-2019

Hi,

I have a vSAN environment that this message appeared when I put the server into maintenance. So I did the rebalancing manually successfully.

Recently this alarm appears again, but this time I did nothing and my environment had no shutdown problem or something similar.

Studying to understand why this error, I found the information that I can execute the RVC command (vsan.check_limits) and check if any of the percentages are next or above 80%.

I want to see these percentages to understand why this occurred, given that our environment has not undergone any changes.

Then the question arose:

1 - Is there any other possibility besides what I found as (HW Failures / Removals, Host being put into maintenance mode or Incorrect policy change)?

As for example failure to allocate a machine on a host that had no hardware resource or some other action other than a physical failure?

The question is how do I find out what is causing this problem !!!

2 - Is there any way I can verify this% information in the Vsphere Client Web?

3 - If we have never done the rebalancing, can I lose data? Can my environment stop?

4 - In the VMware documentation says that I can disable the alarm. I looked and the error we received was yellow, that is, it is not critical. Even without proactive rebalance enabled, rebalancing will occur when it reaches 80% (vsan.check_limits)?

5 - Does proactive rebalance say it will start whenever% (vsan.check_limits) reaches 30% of disks?

Thank you.

TheBobkin · ‎03-23-2019

Hello sandroalvesbrasil,

"I found the information that I can execute the RVC command (vsan.check_limits) and check if any of the percentages are next or above 80%."

vsan.disks_stats <pathToCluster> is best summary view in my opinion - Cluster > Monitor > Disk Management in the GUI is close-second.

"I want to see these percentages to understand why this occurred, given that our environment has not undergone any changes."

Sure there were changes - you stated just 2 lines above this that you put a node into Maintenance Mode - which option you used, how long it was in MM and what other things you did can affect how data is distributed. For example in a 4-node cluster with default RAID1 FTM if you place a node in MM with 'Ensure Accessibility' option with default clom repair delay timer all the data will start rebuilding on the remaining 3 nodes after 60 minutes and thus if this node is not taken out of MM until hours later then all the data on it will be removed (as it is stale and has been rebuilt in its absence) thus the disks will all be 0% used compared to the now higher than they were before % used on the other nodes. Additionally, data on a cluster is rarely static and can grow over time causing other knock-on changes.

"1 - Is there any other possibility besides what I found as (HW Failures / Removals, Host being put into maintenance mode or Incorrect policy change)?"

Have a look at which disks are imbalanced and by how much via the Disk balance Health check - it should be fairly obvious the cause based on what % used disks are and on which hosts (e.g. everything got rebuilt while a host was in MM) but yes other causes are possible e.g. a disk being marked as failed or controller freaking out, data rebuilt elsewhere then the disk/controller starts functioning normally again and data gets placed there (either from VM/vmdk creation, proactive rebalance or reactive rebalance due to disks over 80% used).

"3 - If we have never done the rebalancing, can I lose data? Can my environment stop?"

No, that shouldn't cause issues as reactive rebalancing occurs without user-intervention which starts moving data off disks with 80% (default value) or more used onto lower % used devices where possible - as with any system though, space-management is important and you should be sizing your clusters adequately.

"4 - In the VMware documentation says that I can disable the alarm. I looked and the error we received was yellow, that is, it is not critical. Even without proactive rebalance enabled, rebalancing will occur when it reaches 80% (vsan.check_limits)?"

There is no red condition for Disk-Balance AFAIK as it is not going to negatively impact anything other than potentially missing out on performance gains from more devices.

being actively and equally used in the cluster. If your disk-usage becomes imbalanced due to changes then simply rebalance it.

"5 - Does proactive rebalance say it will start whenever% (vsan.check_limits) reaches 30% of disks?"

Proactive rebalance does not start automatically ever (as the name implies!) - it has to be initiated via the Health check in the GUI or via RVC.

Bob

sandroalvesbras · ‎03-23-2019

Hi,

you really understand the subject and I'm going to ask for patience as I'm getting started on the product. I will take to get your knowledge ..rs

Come on...

I put one of the nodes into maintenance and chose the default option that appears. I did not choose the middle option. It has already been selected and seemed the safest. However, vSAN stayed at 19% for about 20 minutes and we understood that it was stuck and we were worried if we had chosen the right option. Then we canceled. Soon after this message appears. I mean, days after.

So I researched and found out that I could do the rebalancing manually. I did and the status of disk balanced turned green, that is, problem solved.

What I mean is that this was not done again, but the same message returned. That is, if I have not done anything NOW, why does the same message that has occurred previously appear and was resolved?

Do you understand now what I mean?

Looking at your answer to the first question, I understand that really 20 minutes is little, I could have waited longer. Operational failure, okay!

Now that you know I did not set the host on maintenance, I wondered, why did it happen again? Where can I look to see where my problem may be?

I looked at the events for some hardware failure, shutdown servers, but in vCenter I did not find LOGs for this. So I found out that there is vRealize that has LOGs, but I also have not encountered any flaws in vRealize on this day that appears in vxRail Manager when it started cluster failures.

Well, I did the rebalancing again and made an ack on the alarm in the cluster and now no errors appear.

My concern is that this error appears again, so I wanted to monitor the disk balanced to see if at some point it will get back on edge. So I can see what the end user is doing that is causing this failure again.

You told me that when it's 80% rebalanced will happen automatically. I figured it would not happen automatically if proative was not enabled. Got it!

Now I come back to the question ... why did this happen again?

Where should I look to evaluate whether it is close to generating this new warning of need for balanced warning?

At that point, as I said, I've already rebalanced.

Thank you.

All

Virtual SAN Disk Balance