Solved: 2-node has massive imbalance, won't stop re-balanc...

srodenburg · ‎03-06-2018

For some strange reason, a 2-node cluster has a massive imbalance of about 1TB. So the "pro-active rebalance" button was pressed and re-balancing began.

It is still rebalancing, days later...

Normally, after 24 hours it stops automatically and the button is indeed greyed out now. But it is still going at it. The values of "number of objects to re-sync" and "bytes left to re-sync" keep going up and down. It's on 332 GB at the moment, 6 object of about 64GB each. ETA is 8 days....

Policy is FTT=1, Stripe=3 (both nodes have 1 diskgroup with 3 HDD's).

As it's bouncing up and down (normally, the "bytes left to resync" steadily declines), I suspect It has no idea what to do, moves some data to somewhere and discovers that was a bad idea. I don't what that cluster has been smoking but I want some too.

So i'm trying to stop it via rvc. When I do a "vsan.proactive_rebalance --stop 0" the commands barfs with this error on both nodes "RbVmomi::Fault: SystemError: A general system error occurred: Runtime fault"

Are there any other ways to stop the re-balancing? Or do I need to shutdown the entire cluster to bring both nodes back to normal?

I also don't understand how a 2-node cluster, that mirrors anything and everything, can have around 1 TB data more on one node than the other in the first place...

srodenburg · ‎03-08-2018

Hi Bob,

"Is this Hybrid?"

Yes. 200 GB SAS SSD with 3x 6TB Nearline SAS HDD

"I really hope you tried your best to advise against this design choice"

Yes. But I can understand their thinking back then. This cluster serves as an archive and there are two such larger VM's on it, to whom data is written very slowly all day long. It trickles in so to speak. The write-cache on the SSD's is very much under-utilized.

Every now and then, the data is being read in a sequential fashion at high rate which explains the stripe=3 as it speeds up the reading. The cluster works very well for them actually (their application-dev. person is happy with it).

It's just this weird unbalance of one of those two larger VM's that's a bit of a stain on the carpet. It was only noticed recently.

"Have you tried re-applying the Storage Policy for this Object?"

Yeah, has no effect.

I've decided to hook up a NAS on site, storage-vmotion the large VM with the "eternal imbalance" off the vSAN Datastore, make sure everything is clean and that all disks on both nodes are filled to the same degree (all other VM's look good in that respect). Then, storage vMotion the VM back in again. As all HDD's on both nodes will have the same starting point (i want to see it with my own eyes before I press the button), it should now evenly split and divide the ingress data over all capacity disks. That should put this curious case to rest 🙂

View solution in original post

srodenburg · ‎03-06-2018

Update: after 3 days of moving data about, it finally stopped re-balancing. The number of resyncing components is 0. 0 bytes to move. It is done. Finally.

So I ran a health-check again and...

...Imbalance is at 34% and there are 1039.34 GB to move. On one HDD on node 1. The same % and amount of data as 3 days ago. WTF has it been doing the past 3 days? This is nuts !!

It is a 4.6 TB VMDK. So it get's chopped up in many 256GB chunks.

I went to take a look under cluster -> monitor -> vsan -> virtual objects and select the vmdk's. I see that the applied policy is the one with stripe=3, so that's correct, but in reality, per node, most components lie on the same capacity-disk with a few others being on the other two HDD's. So it is indeed stripe=3 but completely crooked (70% of the 256GB chunks on one HDD, 15% on the second HDD and the remaining 15% on the third HDD of the disk-group.

The HDD's are 6 TB in size (SAS Nearline).

I noticed that smaller VMDK's are evenly divided over all 3 HDD's per node. It's only the really big ones that are unevenly divided over the 3 HDD's.

As the datastore is 61% full, I have a feeling that vSAN does not know how to re-balance this 4.6 TB HDD. Maybe it does not have enough free space to really spread out the chunks over all 3 HDD's evenly and after 3 days of re-balancing, this is what it could do under the circumstances?

TheBobkin · ‎03-06-2018

Hello Steven,

"WTF has it been doing the past 3 days?"

You may have answered this already :smileygrin:

"I don't what that cluster has been smoking but I want some too."

If rvc ever "barfs" try restarting it and/or logging in as another 'user' (127.0.0.1 != localhost, so you can use this) and as a last resort, restart the vC services or VM.

What build version are the hosts and vCenter?

What does the vsan.disks_stats output look like?

What is the disk layout and capacity per node?

"I see that the applied policy is the one with stripe=3, so that's correct, but in reality, it actually is stripe=1 because in the preferred fault domain, all raid-0 components are on the same disk on Node A and in the secondary domain, all the components there also lie on one disk only."

Is this for all VMs or just one Object or VM and do they show as noncompliant with their Storage Policy?

What kind of sizes are we talking for said Objects?

Bob

srodenburg · ‎03-06-2018

"but in reality, it actually is stripe=1"

Sorry I got that wrong. I was looking at the cache disk. I corrected my mistake in my post above.

vCenter is 6.5 U1d and ESXi is 6.5 U1 current patch-level.

"What does the vsan.disks_stats output look like?

What is the disk layout and capacity per node?"

See below (I anonymised the host and domain names):

srodenburg · ‎03-06-2018

2018-03-06 23:19:55 +0100: Fetching vSAN disk info from esx01.domain.local (may take a moment) ...

2018-03-06 23:19:55 +0100: Fetching vSAN disk info from esx02.domain.local (may take a moment) ...

2018-03-06 23:19:55 +0100: Fetching vSAN disk info from vsanwitness02.domain.local (may take a moment) ...

2018-03-06 23:19:56 +0100: Done fetching vSAN disk infos

+----------------------+----------------------------+-------+------+------------+---------+----------+----------+----------+----------+----------+---------+----------+---------+

+----------------------+----------------------------+-------+------+------------+---------+----------+----------+----------+----------+----------+---------+----------+---------+

| naa.5000cca0131fb2a8 | esx01.domain.local | SSD | 0 | 186.31 GB | 27.66 % | 27.66 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| naa.5000c50083e29e83 | esx01.domain.local | MD | 24 | 5589.02 GB | 27.43 % | 0.23 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| naa.5000c50084163f3b | esx01.domain.local | MD | 24 | 5589.02 GB | 39.66 % | 0.23 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| naa.5000c500850935ff | esx01.domain.local | MD | 31 | 5589.02 GB | 61.53 % | 0.23 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

+----------------------+----------------------------+-------+------+------------+---------+----------+----------+----------+----------+----------+---------+----------+---------+

| naa.5000cca0131fcdf8 | esx02.domain.local | SSD | 0 | 186.31 GB | 27.66 % | 27.66 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| naa.5000c50083e6d93b | esx02.domain.local | MD | 24 | 5589.02 GB | 36.65 % | 0.23 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| naa.5000c50085093623 | esx02.domain.local | MD | 27 | 5589.02 GB | 55.22 % | 0.23 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| naa.5000c50084ea856b | esx02.domain.local | MD | 28 | 5589.02 GB | 36.77 % | 0.23 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

+----------------------+----------------------------+-------+------+------------+---------+----------+----------+----------+----------+----------+---------+----------+---------+

| mpx.vmhba1:C0:T2:L0 | vsanwitness02.domain.local | SSD | 0 | 10.00 GB | 0.00 % | 0.00 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| mpx.vmhba1:C0:T1:L0 | vsanwitness02.domain.local | MD | 29 | 14.99 GB | 2.76 % | 1.51 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

+----------------------+----------------------------+-------+------+------------+---------+----------+----------+----------+----------+----------+---------+----------+---------+

srodenburg · ‎03-06-2018

I made a screenshot of the "problem" VM. It's disk 1 and 2 are relatively small and are evenly divided over all 3 HDD's per node. So that's as it should be.

This screenshot shows disk 3 (which is 4.6 TB in size) and you will see, if you look closely, all the components (256GB chunks) and most are on the same HDD. Only 2 components sit on the other disks. It was like this before the re-balancing started and after 3 days of re-balancing, this is the result.

Only the preferred FD is show but it's the same story in the other FD (the other node).

TheBobkin · ‎03-08-2018

Hello srodenburg,

200GB:18TB cache:capacity :smileyshocked: ?

Is this Hybrid? I am assuming it is by the capacity-tier drive size and I really hope you tried your best to advise against this design choice :smileygrin:

Have you tried re-applying the Storage Policy for this Object?

Proactive Rebalance is least least least priority for storage resources and potentially this thing is too busy keeping up with everything else to do this (maybe given the fact that it was born with a gimpy leg!) - can you try setting this going via RVC with a reasonable GB per hr move to see if this will actually shift these components?

Maybe consider running this in off-peak hours and stop it if the cluster cannot handle this additional workload (use RVC and don't reference with '0' like before, indicate cluster by path or with '.' when in the cluster directory)

> vsan.proactive_rebalance -r <AsManyMB/HrAsYouThinkThisCanHandleWithoutFallingDown> -s ~cluster

http://www.virten.net/2017/05/vsan-6-6-rvc-guide-part-2-cluster-administration/#vsan-proactive_rebal...

You can even just set this to run at ~200GB/hr for an hour or two to verify that it does shift some components/data.

I tried reproducing this a couple of times (HOL only as currently treating my home-lab as stable only) and yes the vC-initiated rebalance process appears to be super-zombie level of refusing to die but kind of stands to reason as it not just a cluster-level host-to-host process really - one of the times I only managed to stop it by disconnecting the hosts (stop hostd) and restarting the vC.

Relatively large components are no problem if they are the first thing to be written to the cluster but if there is other data on there you can see the problem it likely had about trying to move other stuff (while keeping everything under 80% and not 30% disparity!) so it could move those components and then giving up on that idea due to having to serve IOs for actual workloads etc. .

Bob

srodenburg · ‎03-08-2018

Hi Bob,

"Is this Hybrid?"

Yes. 200 GB SAS SSD with 3x 6TB Nearline SAS HDD

"I really hope you tried your best to advise against this design choice"

Yes. But I can understand their thinking back then. This cluster serves as an archive and there are two such larger VM's on it, to whom data is written very slowly all day long. It trickles in so to speak. The write-cache on the SSD's is very much under-utilized.

Every now and then, the data is being read in a sequential fashion at high rate which explains the stripe=3 as it speeds up the reading. The cluster works very well for them actually (their application-dev. person is happy with it).

It's just this weird unbalance of one of those two larger VM's that's a bit of a stain on the carpet. It was only noticed recently.

"Have you tried re-applying the Storage Policy for this Object?"

Yeah, has no effect.

I've decided to hook up a NAS on site, storage-vmotion the large VM with the "eternal imbalance" off the vSAN Datastore, make sure everything is clean and that all disks on both nodes are filled to the same degree (all other VM's look good in that respect). Then, storage vMotion the VM back in again. As all HDD's on both nodes will have the same starting point (i want to see it with my own eyes before I press the button), it should now evenly split and divide the ingress data over all capacity disks. That should put this curious case to rest 🙂

All

2-node has massive imbalance, won't stop re-balancing