Solved: vSAN Upgrade from 5.5 to 6.0 - Timeout to complete...

alienjoker · ‎11-26-2015

Hi all,

I've been running a 5.5 vSAN setup identical to William Lam at Virtual Ghetto (A killer custom Apple Mac Mini setup running VSAN | virtuallyGhetto) for quite a while and the only reason not to upgrade further was as a consequence of wanting to maintain a legacy Horizon View 5.x configuration which needs vCenter 5.5. I've now decided this needn't dictate my entire environment so after a successful move to 6.0 on both the VCSA and Hosts, I figured I'd upgrade to vSAN 6 to take advantage of the improvements it brings, not to mention remove the nagging alerts about the vSAN upgrade available.

Having run through the necessary prerequisites which did involved clearing down a number of inaccessible vswp files, vSAN reported a good bill of health and so I commenced with the online upgrade, making sure to include –allow-reduced-redundancy as I have a 3 node cluster.

Upon execution and after about 20 minutes of nail biting (accepting the fact that its only hosting Lab machines, their fate could be determined by the success of failure of the upgrade), I was presented the following within the RVC:-

:Failed to remove this disk group from VSAN

:A general system error occurred: Failed to evacuate data for disk uuid 52da00fa-c8f2-fdb8-924c-3007d480ac4d with error: Timeout to complete the operation

:Failed to remove disk group from VSAN, aborting.

:Upgrade tool stopped due to error, please address reported issue and re-run the tool again to finish upgrade

Does anyone have any pointers on how to address the timeout, or what it could elude to as I'd rather understand the root cause than blindly blow away the entire configuration and start again.

Thanks for your help!

All the best

Andrew

alienjoker · ‎11-30-2015

Hi Duncan,

I managed to work around the problem in the end in a bizarre way. I ended up changing the vSAN default storage policy FTT from 0 to 1 and reapplied it to redistribute the data/witness evenly across the hosts as expected to report compliance. After a few hours, I then modified the vSAN default policy back to an FTT of 0 and re-applied. Running the RVC command with the parameter --allow_reduced_redundancy still subsequently failed, but a quick check on the vSAN showed that the majority of the consumption of data was now on hosts 2 and 3 with a tiny 3GB of usage reported against the Disk Group of the first host. At that point, I manually dropped the Disk Group on Host 1 (choosing not to evacuate the small amount of data that was left behind). After recreating the disk group on the host, this automatically came back in as v2 and I repeated the above process across the remaining two hosts effectively shuffling the vSAN contents between hosts and dropping what I can only believe was corrupted/invalid data on the host disk group that couldn't be evacuated by the upgrade process.

At the end of it, all my VMs survived the vSAN upgrade.

Cheers

Andrew

View solution in original post

depping · ‎11-26-2015

Do you have sufficient resources in the cluster to go through the motion?

alienjoker · ‎11-26-2015

Having read through some of Cormacs documentation, I tried to run a vsan.disks_stats and received the following response:-

2015-11-26 20:18:22 +0000: Failed to gather from 192.168.1.52: NoMethodError: undefined method `name' for nil:NilClass

2015-11-26 20:18:22 +0000: Failed to gather from 192.168.1.50: NoMethodError: undefined method `name' for nil:NilClass

2015-11-26 20:18:23 +0000: Failed to gather from 192.168.1.51: NoMethodError: undefined method `name' for nil:NilClass

2015-11-26 20:18:23 +0000: Done fetching VSAN disk infos

+-------------+------+-------+------+-----------+------+----------+---------+

+-------------+------+-------+------+-----------+------+----------+---------+

| N/A | N/A | SSD | 0 | 953.87 GB | 0 % | 0 % | OK (v1) |

| N/A | N/A | SSD | 0 | 953.87 GB | 3 % | 0 % | OK (v1) |

| N/A | N/A | SSD | 0 | 953.87 GB | 0 % | 0 % | OK (v1) |

| N/A | N/A | MD | 44 | 931.25 GB | 49 % | 34 % | OK (v1) |

| N/A | N/A | MD | 12 | 931.25 GB | 19 % | 19 % | OK (v1) |

| N/A | N/A | MD | 37 | 931.25 GB | 51 % | 36 % | OK (v1) |

+-------------+------+-------+------+-----------+------+----------+---------+

Whilst the response returned positive Health on all accounts, the NoMethodError: undefined method 'name' for nil:NilClass is possibly where I'm getting stuck?

alienjoker · ‎11-30-2015

Hi Duncan,

I managed to work around the problem in the end in a bizarre way. I ended up changing the vSAN default storage policy FTT from 0 to 1 and reapplied it to redistribute the data/witness evenly across the hosts as expected to report compliance. After a few hours, I then modified the vSAN default policy back to an FTT of 0 and re-applied. Running the RVC command with the parameter --allow_reduced_redundancy still subsequently failed, but a quick check on the vSAN showed that the majority of the consumption of data was now on hosts 2 and 3 with a tiny 3GB of usage reported against the Disk Group of the first host. At that point, I manually dropped the Disk Group on Host 1 (choosing not to evacuate the small amount of data that was left behind). After recreating the disk group on the host, this automatically came back in as v2 and I repeated the above process across the remaining two hosts effectively shuffling the vSAN contents between hosts and dropping what I can only believe was corrupted/invalid data on the host disk group that couldn't be evacuated by the upgrade process.

At the end of it, all my VMs survived the vSAN upgrade.

Cheers

Andrew

All

vSAN Upgrade from 5.5 to 6.0 - Timeout to complete the operation