VMware Cloud Community
johandijkstra
Enthusiast
Enthusiast

vSAN 6.2 Rebalancing not working, but a failover after a failure also not working

Hi all,

We are experiencing the following in a production environment.
Rebalancing never worked before, as this was not the main concern, it was on the bottom of the todo list.


Last Friday we had an issue with the Collection Data Provider for Performance Measurement. With VMware we fixed this.

We had a little discussion about the vSAN health and did together some check. All seems OK.

I mentioned that the rebalancing is not working. VMware said, well... it should be working fine now, try to run the rebalance this evening.
So I did, the following morning I checked vSAN, all stable but rebalance was not performed. So I did it again.

Sunday morning, same story... I started it and monitored it. After 10 ~ 15 minutes it was "done" (no message that it was done, only not saying Rebalancing....)

So I left it that way. Whole day no issues, but at 10.30 PM, a cache disk failed. (we encounted that, this morning).

The issue is, that a "failover" (we have tolerance of 1) did not happen. Vmware Composer still tries to create systems even on the failed disk group! Resulting in multiple issues.

Broken VM's etc.

What my main concern is, a rebalance does not initiate after hitting the button, a failover does not initiate after a disk failure.


I thought that the rebalance might have caused this issue, but rebalance takes place on the elastic drives, not the flash.

So, bad timing in that matter.


But it is very interesting that these things have relation in the way it is being initated, but not being performed.


At this moment we are troubleshooting with VWware, but anyone experienced the same thing? Or any suggestions?

0 Kudos
4 Replies
mhampto
VMware Employee
VMware Employee

Hello,

Please let us know if there is a resolution to this from the support call so others can know the answer.

0 Kudos
Pattonville
Contributor
Contributor

I am experiencing the same issue on my cluster.  Rebalance task starts but there is no rsync activity on the dashboard.   Trying to put a vsan host into maintenance also fails at the vsan maintenance step after 15 mins.  All heath checks look good on the objects but it seems stuck.  Any ideas or followup to how to resolve this would be appreciated.

0 Kudos
TheBobkin
Champion
Champion

Hello Pattonville

More context might help here:

How many nodes in the cluster?

Stretched-cluster or standard?

What Fault Tolerance Method(s) do you have applied via Storage Policies? (e.g. RAID1, RAID5/6)

Where are you noting that 'rebalance' may be necessary (e.g. via the Health check) and have you checked via RVC using vsan.proactive_rebalance_info ?

How much GB/TB space free and % free in the vSAN clusters vsandatastore?

What option for Maintenance Mode was selected? (e.g. 'No Action', 'Ensure Accessibility' or 'Full Data Migration)

What % did the enter MM job fail at and what was the error message? (This can fail at pre-check 2% or fail when trying to vMotion VMs at 19% etc.)

Bob

0 Kudos
Pattonville
Contributor
Contributor

So I checked on the cluster this morning and retried starting a balance and lo and behold it worked!  It is worth rehashing what circumstances incase someone else runs into this as it was not covered in any vsan troubleshooting docs I could find.

How many nodes in the cluster?

4

Stretched-cluster or standard?

Standard

What Fault Tolerance Method(s) do you have applied via Storage Policies? (e.g. RAID1, RAID5/6)

Default Raid 1

Where are you noting that 'rebalance' may be necessary (e.g. via the Health check) and have you checked via RVC using vsan.proactive_rebalance_info ?

A hosts disk group was removed via full migration to replace a 3700 sata ssd with a 4800x pcie optane nvme ssd and 4 port 1g nic witjh a 2 port 10g nic.  The disk group was re added and a rebalance to repopulate was needed.   There were other hardware upgrades scheduled for other hosts in the cluster and thats why MM was attempted on a different vsan memeber host hoping a 'Ensure Accessibility' migration might also trigger a rebalance on the new disk group.  It failed instead.

How much GB/TB space free and % free in the vSAN clusters vsandatastore?

What option for Maintenance Mode was selected? (e.g. 'No Action', 'Ensure Accessibility' or 'Full Data Migration)

Ensure Accessibility

What % did the enter MM job fail at and what was the error message? (This can fail at pre-check 2% or fail when trying to vMotion VMs at 19% etc.)

After vMotion VM's 34% "Entering vsan maintenance mode" and fail IIRC was canceled by user which I had not done.

The one thing that may have happen is that one host has its vsan nic port briefly interrupted for 5 secs for a move from a 1g switch to a 10g switch.   This may have trigged the rebuild timer which was set for 240 mins instead of the default 60 as i have been burned by host patches triggering rebuilds by not exiting MM in under an hour and have since added more time to work on hosts.   Even though no rebuild was required and the vsan stats showed objects in sync the timer may have been in the background preventing me for rebalance or MM.   I am not sure how if you can check the timer is active if no vms are set to rebuild but that would have been good info and would explain why after a night to settle the cluster is working again as expected.

0 Kudos