Re: vsan resync stuck

dgreebe · ‎02-01-2017

Hi All,

I'm doing a migration from vsan 6.1 to 6.2 and on 2 hosts it all went fine, except for the 3rth.

We are doing a full migration so a resync is started and its running for about 10 hours. I'm keeping an eye on the resync, and I noticed that 2 vmdk's (2 seperate VM's) are having trouble to migrate.

Total GB to migrate goes from 15 GB to 1GB to 15GB or 17Gb and that keeps shifting up and down..

When I take a good look at more details what is going on, I see that does objects are on seperate disk groups and I see in my vsan.disk_object_info that a vmdk has the state "RECONFIGURING".

dataToSync is at the moment 5GB of a total of 7.6GB.

After waiting for over an hour, I've canceled the maintenance-mode and but that host back in maintenance but now with the default maintenance-option.

Host went successfully in maintenance and I've performed my upgrade. The host also has rebooted and its currently up-and-running, but my resync is still "working"

I now want to know what I can do further about this. I do have the to vsan.exit_evacuation but is that the best way to do so or are there other ways to stop this, with the minimal impact of loosing data.

thanks in advance

Dave Greebe

elerium · ‎02-02-2017

Waiting an hour for status to change is nothing for the migration, I probably would have waited longer to see if that pattern persisted over a longer period. My 6.1 to 6.2 migration took 4 hosts 6 or 7 days to fully migrate/resync. My first host took a lot less time than the remaining (i was using ensure accessibility and not full data copy for the migration). Also the migration step has 2 phases.

Resync after hosts boot up is pretty normal too. Another possibility is you have capacity disks over 80% utilized, in this case the cluster will rebalance to even out the data distribution.

At this point, your probably need to restart the migration step, but I wouldn't do so until you see your current resync operation stop.

dgreebe · ‎02-02-2017

Hi Elerium,

Update of every host also took me more then 10 hours with full data migration (each host as 10TB of HDD's), but that is not the issue.

I was doing my update on the 3rth host and after 10 hours, 2 vmdks kept looping. The data left kept shifting from 0GB to 15GB to 5GB and then 12GB. I was watching that for almost 2 more hours but nothing more happen then that.

I checked my disk-stats but no rebalancing was in progress... it was not even necessary because my capacity of disks are about 70% full. Automated rebalance occur only at 80% or more..

The resync i'm referring to, it the resync of the full migration.

After waiting more hours, i decided to stop the maintenance, and selected ensure availability. My host when in maintenance mode fine without any issues and I perform my update.

Update went fine and my host is running smoothly.

The day after, I put my 4th host into maintenance and full data migration started. After 12 hours the resync still stucked at the same vmdk's as yesterday. Waiting more 2 hours but nothing more happen. Again I decided to stop the maintenance, and selected again ensure availability but that when less fine. Maintenance mode was stuck at 70% for 1 hours, an our later it was 72% and again 1 hour later still 72%.

I checked which objects where the issue and again, those vmdks.

Then I made a drastic decision to put the host in maintenance, without data migration. My cluster had FTT=1 so I was 100% sure that my vm wouldn't be killed because the replica of the vmdk was running on a other host that was up-to-date.

Know my host went in maintenance and I could perform my last update.

That went smoothly and now my cluster is running on 6.2

When I now check the vm's (2) of which the vmdk-objects where "stuck", the VM's are having 4 objects. 2 active, 2 re configuring.

How can I solve that ? that re configuring is running for more then 12 hours and I know for sure that those VM's are not heavily used

admin · ‎02-02-2017

Looks like you have inconsistent objects in your vSAN. I would suggest to open a Support Ticket with VMware Global Support team. They would most probably check it through CLI and recreate the problematic objects.

Cheers!

-Shivam

piotrowski89p · ‎02-20-2017

Hi,

Did you try to use command vsan.check_state ~cluster --refresh-state ?

This scans all objects and for my in similar situation it moved the resynchronization forward.

BR,

Pawel