VMware Cloud Community
alainrussell
Enthusiast
Enthusiast

Ondisk Upgrade stalled 5%

We upgraded to 6.2 today following the release of new a driver & firmware for the H730.

Everything went well until the OnDisk Upgrade, after this was initiated we had an issue with a host where it locked up (same issue we've seen previously with disk resets) - this has caused the Disk upgrade to stall at 5% with some hosts at v2 format and some hosts at v2.5 format.

Is there a way to trigger this to restart?

This is the output of vsan.disks_stats, the upgrade has been on 5% for >1 hour now and no changes since the host froze.

disk format.png

vsan.obj_status_report shows all objects are currently healthy.

status.png

Thanks

Alain

Reply
0 Kudos
12 Replies
CHogan
VMware Employee
VMware Employee

Use /mgmt-vc01/mgmt-datacentre/computers> vsan.upgrade_status 0 -r 60


This step can take a long time.


More details here: VSAN 6.2 Part 12 - VSAN 6.1 to 6.2 Upgrade Steps - CormacHogan.com

http://cormachogan.com
Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

Thanks Cormac,

I've been watching that - and still 5% (after ~3 hours now) .. but that may still fall into an acceptable "long time" timeframe?

> vsan.upgrade_status 0

2016-04-29 19:40:33 +1200: Upgrade in progress - 5%

2016-04-29 19:40:34 +1200: Updating objects to alignment

2016-04-29 19:40:34 +1200: 260 objects in which need realignment process

2016-04-29 19:40:34 +1200: 0 objects with new alignment

2016-04-29 19:40:34 +1200: 0 objects ready for v3 features

2016-04-29 19:40:34 +1200: Upgrade invovles resyncing objects at times, showing current resync progress

2016-04-29 19:40:35 +1200: Querying all VMs on VSAN ...

2016-04-29 19:40:35 +1200: Querying all objects in the system from esx04.......

2016-04-29 19:40:35 +1200: Got all the info, computing table ...

+-----------+-----------------+---------------+

| VM/Object | Syncing objects | Bytes to sync |

+-----------+-----------------+---------------+

+-----------+-----------------+---------------+

| Total     | 0               | 0.00 GB       |

+-----------+-----------------+---------------+

Reply
0 Kudos
CHogan
VMware Employee
VMware Employee

I would have expected at least some objects with new alignment after this time - I can't remember how long we waited during our tests, but it did take a while.

http://cormachogan.com
Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

I'm guessing the objects won't re-align until all disks are on v2.5 format (which we don't have).

In total we have 5 Hosts - esx01 to esx05, currently esx01 & esx02 are still showing as v2 disk format, esx03, esx04 & esx05 are showing as v2.5.

Looking at the logs I can see the esx01 & esx02 hosts are logging errors constantly into /var/log/syslog (unsure if this is related)

2016-04-29T08:27:42Z watchdog-vsanperfsvc: Executing 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc'

2016-04-29T08:27:42Z watchdog-vsanperfsvc: Executing 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc'

2016-04-29T08:27:44Z watchdog-vsanperfsvc: 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc' exited after 2 seconds (quick failure 4)

2016-04-29T08:27:44Z watchdog-vsanperfsvc: 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc' exited after 2 seconds (quick failure 1)

2016-04-29T08:27:44Z watchdog-vsanperfsvc: Executing 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc'

2016-04-29T08:27:44Z watchdog-vsanperfsvc: Executing 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc'

2016-04-29T08:27:45Z watchdog-vsanperfsvc: 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc' exited after 1 seconds (quick failure 2)

2016-04-29T08:27:45Z watchdog-vsanperfsvc: Executing 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc'

2016-04-29T08:27:45Z watchdog-vsanperfsvc: 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc' exited after 1 seconds (quick failure 5)

2016-04-29T08:27:45Z watchdog-vsanperfsvc: Executing 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc'

2016-04-29T08:27:46Z watchdog-vsanperfsvc: 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc' exited after 1 seconds (quick failure 6)

2016-04-29T08:27:46Z watchdog-vsanperfsvc: 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc' exited after 1 seconds (quick failure 3)

2016-04-29T08:27:46Z watchdog-vsanperfsvc: 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc' respawning too fast, sleeping for 5 seconds

2016-04-29T08:27:46Z watchdog-vsanperfsvc: Executing 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc'

2016-04-29T08:27:47Z watchdog-vsanperfsvc: 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc' exited after 1 seconds (quick failure 4)

2016-04-29T08:27:47Z watchdog-vsanperfsvc: Executing 'python ++group=host/vim/vmvisor/vsanperfsvc /usr/lib/vmware/vsan/perfsvc/vsanperfsvc.pyc'

Might be time to talk to support?

Reply
0 Kudos
zdickinson
Expert
Expert

Good morning, looks like a support call to me.  Good luck!  Thank you, Zach.

Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

Still at 5% this morning so I've opened a case with Dell (They provide all our support).

Dell have escalated and the VMWare support case is #16976339504

Reply
0 Kudos
NickBowie
Enthusiast
Enthusiast

Hi alainrussell . Any update on your issue? I'm experiencing the same with another customer now. I had to pause the time with the engineer and will resume tomorrow - but the current suspicion is that we need to realign items that the python script isn't picking up.

We also noticed every time we initiated the vsan on disk upgrade from RVC, a new disk would enter into v2.5 before the task failed.

-->

vsan.resync_dashboard .

2016-05-31 05:58:41 +0000: Querying all VMs on VSAN ...

2016-05-31 05:58:41 +0000: Querying all objects in the system from aklesx11.localdom.co.nz ...

2016-05-31 05:58:42 +0000: Got all the info, computing table ...

+-----------+-----------------+---------------+

| VM/Object | Syncing objects | Bytes to sync |

+-----------+-----------------+---------------+

+-----------+-----------------+---------------+

| Total     | 0               | 0.00 GB       |

+-----------+-----------------+---------------+

/aklvvc32.localdom.co.nz/AKL/computers/AKL-CL03> vsan.ondisk_upgrade .

+--------------------+-----------+-------------+----------------+----------------+------------------+----------------+

| Host               | State     | ESX version | v1 Disk-Groups | v2 Disk-Groups | v2.5 Disk-Groups | v3 Disk-Groups |

+--------------------+-----------+-------------+----------------+----------------+------------------+----------------+

| aklesx11.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx13.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx14.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx15.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx12.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx16.localdom.co.nz | connected | 6.0.0       | 0              | 1              | 3                | 0              |

| aklesx18.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx17.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

+--------------------+-----------+-------------+----------------+----------------+------------------+----------------+

2016-05-31 06:20:33 +0000: Running precondition checks ...

2016-05-31 06:20:35 +0000: Passed precondition checks

2016-05-31 06:20:35 +0000:

2016-05-31 06:20:35 +0000: Target file system version: v3

2016-05-31 06:20:35 +0000: Disk mapping decommission mode: evacuateAllData

2016-05-31 06:20:41 +0000: Upgrade tool stopped due to error, please address reported issue and re-run the tool again to finish upgrade.

/aklvvc32.localdom.co.nz/AKL/computers/AKL-CL03> vsan.ondisk_upgrade .

+--------------------+-----------+-------------+----------------+----------------+------------------+----------------+

| Host               | State     | ESX version | v1 Disk-Groups | v2 Disk-Groups | v2.5 Disk-Groups | v3 Disk-Groups |

+--------------------+-----------+-------------+----------------+----------------+------------------+----------------+

| aklesx11.localdom.co.nz | connected | 6.0.0       | 0              | 3              | 1                | 0              | < increment

| aklesx13.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx14.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx15.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx12.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx16.localdom.co.nz | connected | 6.0.0       | 0              | 0              | 4                | 0              | < and here

| aklesx19.localdom.co.nz | connected | 6.0.0       | 0              | 0              | 0                | 0              |

| aklesx18.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

| aklesx17.localdom.co.nz | connected | 6.0.0       | 0              | 4              | 0                | 0              |

+--------------------+-----------+-------------+----------------+----------------+------------------+----------------+

2016-05-31 06:36:57 +0000: Running precondition checks ...

2016-05-31 06:36:59 +0000: Passed precondition checks

2016-05-31 06:36:59 +0000:

2016-05-31 06:36:59 +0000: Target file system version: v3

2016-05-31 06:36:59 +0000: Disk mapping decommission mode: evacuateAllData

2016-05-31 06:37:05 +0000: Upgrade tool stopped due to error, please address reported issue and re-run the tool again to finish upgrade.

<-----

Also DELL R730xd's, with current HCL-compliant drivers and FW for all components. Our SR# is 16126628305.

Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

Hi Nick, we ended up working with support and stopping our upgrade - it was "stuck" at 5% for a week or so without progressing. Once it was stopped I needed to fix CBT issues with the python script, we did't have any other alignment issues luckily.

As soon as these were sorted the upgrade progressed past 5% .. albeit slowly, overall it was a 48+ hour process for us to move to V3 - but all good now.

Reply
0 Kudos
NickBowie
Enthusiast
Enthusiast

Hi Alain,

Interesting. Did you run the "python vsanrealign.py fixcbt" command? That hasn't resulted in any success in my case.

Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

Yes, that fixed all of our issues - I think every single VM from memory - it was a long list.

Reply
0 Kudos
NickBowie
Enthusiast
Enthusiast

Ok, we must have a different issue then. My one just faults out at 5% (doesn't get 'stuck'), with a non-descriptive "general VSAN error" occurred.

Cheers.

Reply
0 Kudos
NickBowie
Enthusiast
Enthusiast

My issue is resolved - see this thread for info: Re: SAN 6.2 on disk upgrade fails at 5%

Reply
0 Kudos