Solved: Re: VSAN Disk Format to 3.0 failed // Failed to re... - Page 2

KurtDePauw1 · ‎03-17-2016

Hello all,

After the upgrade to "6.0 Update 2" and trying to update the disk format to version 3.0 I received the below error message.

Failed to realign following Virtual SAN objects:

be4db256-0ab4-9801-c523-0cc47a3a34ca, c2bcb056-5d5f-0f02-f096-0cc47a3a3320, c0bcb056-70a6-180c-c583-0cc47a3a3320, e413b256-90b5-dd1a-cce6-0cc47a3a34ca, c2bcb056-64aa-ee1e-0bb3-0cc47a3a3320, c0bcb056-4b47-5529-b59d-0cc47a3a3320, 6644b256-e0f8-3338-737d-0cc47a3a34ca, d411b256-08ea-f83b-9774-0cc47a3a34ca, c0bcb056-a489-9541-efbe-0cc47a3a3320, 3f6abc56-28fd-4a48-4f8a-0cc47a3a34ce, d411b256-784e-a45a-714d-0cc47a3a34ca, c1bcb056-6671-c45c-2254-0cc47a3a3320, 3f6abc56-1061-ab60-e4c8-0cc47a3a34ce, f17bb356-fcdf-3469-502a-0cc47a3a3320, cfe8b156-606f-ee6a-4356-0cc47a3a34ce, d411b256-4450-3777-e3bf-0cc47a3a34ca, c1bcb056-b24e-cb7a-7b1a-0cc47a3a3320, f17bb356-f008-9581-2468-0cc47a3a3320, 31bdb056-0a02-4882-9094-0cc47a3a34ca, c1bcb056-2a83-a596-ed2c-0cc47a3a3320, f17bb356-08a1-529f-c375-0cc47a3a3320, c1bcb056-4edc-ffaf-bf04-0cc47a3a3320, c1bcb056-c9e2-fec8-1930-0cc47a3a3320, c1bcb056-4e92-e4e4-9b92-0cc47a3a3320, e413b256-dcd0-f6fd-8ae2-0cc47a3a34ca,

due to being locked or lack of vmdk descriptor file, which requires manual fix.

Does someone has an idea to fix this ?

AlexanderLiucka · ‎03-30-2016

Hi Bill,

Don't worry. As said CHogan, that we can expect in very short time frame, the fix for this problem. By maybe Chogan forgot to mention that for him a short time frame is around this new year or maybe next new year.

But how I sad. Don't worry Bill. They are just finalizing the testing on the scripts.

I can't imagine how they announced around 10 February 2016 that they are ready with the new vsan 6.2 and it will be available in March 2016.

Can you imagine that with a ready product, that you delay it with a month, because you have a release cycle and voila the product is full with bugs? I just can't imagine such think but it happens all the time.

douglasarcidino · ‎03-30-2016

With all due respect, this is why I don't deploy in production for a few months any new product from any vendor. I have only deployed 6.2 in my home lab and office lab environments. I do it that way so I can see what bugs are there and learn how to fix them. The upgrade wasn't smooth in my home lab but it was bulletproof in my office lab. Early adoption comes with risks and that's why you have a support contract. the fact that the issue has been identified and they are almost ready with a patch is good news. This is still nowhere near as bad of a bug the EMC VNX2 rebooting both storage processors simultaneously after 60 days of uptime. This nowhere near as bad as the early VMFS5 updates that wiped out your datastore. This is nowhere near as bad as the vCenter 5.1 update that destroys your vCenter server.

If you found this reply helpful, please mark as answer VCP-DCV 4/5/6 VCP-DTM 5/6

Bill_Oyler · ‎03-30-2016

Yes, we have not yet seen VSAN deployed in a "production" environment (thank goodness!). The errors we are seeing are in our VSAN lab, which consists of (3) servers, all of which components are on the VMware HCL. I've been running VSAN in our lab since GA of VSAN 5.5, and it's required a lot of "hand holding" and remediation when things go wrong. I was hoping that Gen4 (VSAN 6.2) would be "production-ready" but I'm not sure yet. Hope this issue gets resolved soon.

Bill Oyler Systems Engineer

CHogan · ‎03-30-2016

The KB articles and the script are now available to resolve this issue. A more permanent fix is in the works.

Details of the issue, including links to KBs and scripts can be found here - http://cormachogan.com/2016/03/31/vsan-6-2-upgrade-failed-realign-objects/

Thanks for your patience.

http://cormachogan.com

Bill_Oyler · ‎03-31-2016

Thanks for the script, Cormac. It turned out that in my case, all of the objects that failed "due to being locked or lack of vmdk descriptor file" were either ".vswp" files, or "-internal.vmdk" files from floating VMware Horizon View desktop VMs. These files were not orphaned according to VSAN -- running the VSAN health check in RVC resulted in no orphaned objects. They seemed to be "locked" however, despite me having rebooted all of the ESXi hosts and restarted management agents. I think the root cause is that VMware Horizon View comes with VSAN storage policies out-of-the-box specifying FTT=0 for "floating" desktops. This is dangerous IMHO because host reboots and ESXi patching causes these "floating" VMs to become in a "zombie" state. I am guessing this is why I ended up with all of these locked files. I have subsequently changed the VSAN storage policies to FTT=1 in hopes of preventing this in the future. I ended up deleting and re-provisioning the floating desktops via VMware Horizon View, and using the "/usr/lib/vmware/osfs/bin/objtool delete -u" option to delete the old object references that did not go away cleanly. Then I was able to proceed with the VSAN v3 file system upgrade. This was all quite labor intensive -- way more "babysitting" than I've ever had to do with a traditional external storage array -- so I sure hope VMware can focus on making VSAN as user-friendly as traditional external storage arrays, with less manual administrative effort required. It's frustrating how much "poking around" needs to be done in RVC and the ESXi CLI to resolve some of these obscure issues.

Bill Oyler Systems Engineer

CHogan · ‎04-01-2016

Thanks for this data point Bill. I shared it with our engineering team.

http://cormachogan.com

KurtDePauw1 · ‎04-01-2016

Hello all ....

Today I have ran the script and it stated "upgraded" ... al VM's had a CBT so it looked like it was fixed by running the script.

Unfortuantely after trying to upgrade the disks I got 2 below errors ( this is a 3 node VSAN )

Remove disks from use by Virtual SAN

10.0.0.113

A general system error occurred: Failed to evacuate

data for disk uuid

528cebe8-a2c7-45ed-2065-89ba46acb1c9 with error:

Out of resources to complete the operation

com.vmware.vsan.health

vcenter.vsphere.local

01-Apr-16 11:05:30 AM

01-Apr-16 11:05:30 AM

01-Apr-16 11:05:34 AM

Convert disk format for Virtual SAN

VSAN Cluster

A general system error occurred: Failed to evacuate

data for disk uuid

528cebe8-a2c7-45ed-2065-89ba46acb1c9 with error:

Out of resources to complete the operation

xxxxxxxx\Administrator

vcenter.vsphere.local

01-Apr-16 11:05:27 AM

01-Apr-16 11:05:27 AM

01-Apr-16 11:05:40 AM

paudieo · ‎04-01-2016

Hi

The release notes call out that you have to use rvc script in the case of limited resources on a VSAN cluster, e.g. 3 Node

VMware Virtual SAN 6.2 Release Notes

Upgrading the On-disk Format for Hosts with Limited Capacity

During an upgrade of the Virtual SAN on-disk format, a disk group evacuation is performed. Then the disk group is removed and upgraded to on-disk format version 3.0, and the disk group is added back to the cluster. For two-node or three-node clusters, or clusters that do not have enough capacity to perform an evacuation of each disk group, you must use the following RVC command to upgrade the on-disk format: vsan.ondisk_upgrade --allow-reduced-redundancy

When you allow reduced redundancy your VMs will be unprotected for the duration of the upgrade, because this method does not evacuate data to the other hosts in the cluster. It simply removes each disk group, upgrades the on-disk format, and adds the disk group back to the cluster. All objects remain available, but with reduced redundancy.

If you enable deduplication and compression during the upgrade to Virtual SAN 6.2, you can select Allow Reduced Redundancy from the vSphere Web Client.

KurtDePauw1 · ‎04-02-2016

All is working fine now, resync took more than a day :smileyshocked:

Thanks to CHogan and the Vmware Team for helping the community with the script ( the problem @ my side was the CBT )

and thanks to Paudieo for pointing out to the 3 host setup command line !!

I worked in a huge enterprise before with enough cash and a test lab was "normal to have".

After switching job to a much smaller company after 14 years I never though a lab setup would come in handy that much.

The past week we have ordered a test environment to test the updates before implementing them into a live environment!

I recommend this to all sysadmins no matter how small your environment is, :smileylaugh:

nohome · ‎04-12-2016

Please refer to this document, can solve this problem.

VMware KB: VMware Virtual SAN 6.2 on disk upgrade fails at 10%

Perttu · ‎04-19-2016

Hi all

We're having this issue with total of 62 objects and there are exactly two kinds of these. Most of these are due to a broken snapshot chain because of a missing replica at the beginning of the chain, all of these are VDI linked clones, and the rest are App Volumes vmdks with a different error. What is interesting is that all these VDI linked clone objects having issues are actually orphaned i.e. there is no anymore actual VM referring to these files. I have no idea which actual mechanism produces such residues. Here are some samples from the vsanrealign.py script output:

Scanned 340 of 349 namespaces so far

Finished scanning, compiling results

-------------------------------------------------------------------

These objects have descriptors, but are part of a snapshot chain where the chain couldn't be opened.

-------------------------------------------------------------------

Object UUID: b312e256-f9d6-dfb5-5112-5cb90188c50c

Recorded Path: /vmfs/volumes/vsan:5228d697170dd8f1-404d06037b4f7f03/7d52d956-b88a-bf6d-fa69-5cb90188c5cc/vdi2-nameofthevm.vmdk

Recorded VM: vdi2-nameofthevm

Output of chain consistency check:

'vmkfstools -e /vmfs/volumes/vsan:5228d697170dd8f1-404d06037b4f7f03/7d52d956-b88a-bf6d-fa69-5cb90188c5cc/vdi2-nameofthevm.vmdk'.

Disk link /vmfs/volumes/vsan:5228d697170dd8f1-404d06037b4f7f03/7d52d956-b88a-bf6d-fa69-5cb90188c5cc/vdi2-nameofthevm.vmdk successfully opened.

Failed to open disk link /vmfs/volumes/vsan:5228d697170dd8f1-404d06037b4f7f03/cb03e256-f841-1316-9396-5cb90188c500/replica-f1e82627-4427-4d0c-8139-7ef968c7d79d_3-000005.vmdk :The system cannot find the file specified (25)Disk chain is not consistent : The parent of this virtual disk could not be opened (23)

Fix issue with parent if possible. See KB 1004232

Recommended Remove Action: vmkfstools -U '/vmfs/volumes/vsan:5228d697170dd8f1-404d06037b4f7f03/7d52d956-b88a-bf6d-fa69-5cb90188c5cc/vdi2-nameofthevm.vmdk'

and

-------------------------------------------------------------------

These objects encountered an unknown error during the scanning process

-------------------------------------------------------------------

Object UUID: f607e156-8897-76f4-ec57-5cb90188c620

Recorded Path: /vmfs/volumes/vsan:5228d697170dd8f1-404d06037b4f7f03/cloudvolumes/apps/BasicSoftwareSuite!20!10.3.2016.vmdk

Recorded VM: cloudvolumes

Errors:

2016-04-15T22:48:38.645Z VsanSparseRealign: GetExtents: DiskLib_Open() failed Disk encoding error (61)

2016-04-15T22:48:38.645Z VsanSparseRealign: Error Closing handle: Disk encoding error (61)

Let's investigate what is actually on the vsanDatastore:

[user@host] ls -l /vmfs/volumes/vsanDatastore/vdi2-nameofthevm*

lrwxr-xr-x 1 root root 36 Apr 19 11:25 /vmfs/volumes/vsanDatastore/vdi2-nameofthevm -> 7d52d956-b88a-bf6d-fa69-5cb90188c5cc

lrwxr-xr-x 1 root root 36 Apr 19 11:25 /vmfs/volumes/vsanDatastore/vdi2-nameofthevm_1 -> 4862fe56-842f-5c43-efdd-5cb90188c500

[user@host] ls -ld /vmfs/volumes/vsanDatastore/7d52d956-b88a-bf6d-fa69-5cb90188c5cc

drwxr-xr-t 1 root root 4480 Mar 30 05:50 /vmfs/volumes/vsanDatastore/7d52d956-b88a-bf6d-fa69-5cb90188c5cc

[user@host] ls -ld /vmfs/volumes/vsanDatastore_Desktop_1/4862fe56-842f-5c43-efdd-5cb90188c500

drwxr-xr-t 1 root root 4340 Apr 18 12:50 /vmfs/volumes/vsanDatastore/4862fe56-842f-5c43-efdd-5cb90188c500

As we see from the timestamps, the first directory is orphaned but is filling up the space on the vsan datastore with all the vmdk descriptors and vm home files.

Therefore we would like to ultimately have a VMware provided process/script that would go through all VSAN objects and let user delete the ones that are not being referred by a VM anymore. It is quite tedious to destroy the vmdk's one by one with a vmkfstools -U command, I suppose that non-vmdk files could be just rm'd. Also I have no idea what to do with these App Volumes errors.

elerium · ‎05-23-2016

I upgraded to 6.2 over the weekend with a patched ESXi 6.0U2 cluster with ESXi600-201605001 included which should fix this issue but still got this error on about 130 objects. I tried the VsanRealign.py script on VMware KB: VMware Virtual SAN 6.2 on disk upgrade fails due to CBT enabled virtual disks and the script ran without any errors. I don't use any backup software on this cluster so I don't think it's CBT related.

Ultimately I was able to fix it manually (very tedious), by running "vsan.object_info <cluster> <uuid>" in rvc to find the associated VMs, and then using storage vmotion to move it to another cluster (VSAN 6.1), and then vmotion back.

Bleeder · ‎05-25-2016

Also still getting the error even with ESXi600-201605001 on all hosts. The sad thing is, prior to this I had opened a case with VMware to be sure everything was in order before upgrading. They said everything looked fine, but to wait for that patch because we use App Volumes. I guess there is yet another bug that hasn't been made public..

No CBT/backups in use here either.

All

VSAN Disk Format to 3.0 failed // Failed to realign following Virtual SAN objects // due to being locked or lack of vmdk descriptor file, which requires manual fix.

Upgrading the On-disk Format for Hosts with Limited Capacity