VMware Horizon Community
LukaszDziwisz
Hot Shot
Hot Shot

AppVolumes 2.18.6 weird file lock issue

Hello Everyone,

We have upgraded our AppVolumes Managers and Agents from 2.16 to 2.18.6 and since then we are seeing some weird issues with user writables. We have two separate sites with separate AV Managers and separate AV DBs  running on separate storage and vcenters. THe issue we are seeing is on both sites since the upgrade.

What is happening is that user is trying to log in and it just spins on Preparing Windows, in View I see the VM in "Already Used Status", in vCenter the VM looks good and has nothing attached to it, no appstacks no writable. When I look at the AV Manager System Messages tab I see the following:

Failed to reconfigure VM "VMNAME" (50134816-444f-a10d-9fd5-aa2d4fec10ff):
Fault: Failed to add disk 'scsi0:1'. (GenericVmConfigFault)
* Failed to add disk 'scsi0:1'. (msg.disk.hotadd.Failed)
* Failed to power on 'scsi0:1'. (msg.disk.hotadd.poweron.failed)
* Cannot open the disk '/vmfs/volumes/5d88ee62-45e75ba2-6025-0025b5aa0afd/cloudvolumes/writable/DOMAIN!5C!username!20!on!20!W10x64.vmdk' or one of the snapshot disks it depends on. (msg.disk.noBackEnd)
* Failed to lock the file (msg.fileio.lock)
* File system specific implementation of OpenFile[file] failed (vob.fssvec.OpenFile.file.failed)
* File system specific implementation of OpenFile[file] failed (vob.fssvec.OpenFile.file.failed)
* File system specific implementation of OpenFile[file] failed (vob.fssvec.OpenFile.file.failed)
* File system specific implementation of OpenFile[file] failed (vob.fssvec.OpenFile.file.failed)
* File system specific implementation of OpenFile[file] failed (vob.fssvec.OpenFile.file.failed)
* File system specific implementation of OpenFile[file] failed (vob.fssvec.OpenFile.file.failed)

 

When I look on Writable section and search for it I see it in Detached status and user is showing in Offline status. Now in vCenter, I can see the VMDK and Metadata file for that user, however when I try to copy it it errors out with the following message:

Unable to access file [DATASTORE] cloudvolumes/writable/DOMAIN!5C!username!20!on!20!W10x64.vmdk since it is locked

I looked on ESXi when that happens and it reports that the file is not locked by any ESXi host in the cluster. AppVolumes is the only service that has access and performs operations on those disks

It is very random and I cannot reproduce it. Usually seeing 1 or 2 instances like that a day. What I do as a workaround is that I switch user's homesite to another site and they can log in there just fine. Then couple of hours later, sometimes next day I just switch it back and their writable is working just fine without any action.

 

Did anyone of you encounter similar problem? I do have a ticket opened with VMware but so far didn't hear anything back.

 

Any help would be greatly appreciated

Reply
0 Kudos
8 Replies
richiefez
Enthusiast
Enthusiast

Hi, 

Were you able to fix it?.. If so, could you please share the fix.

Reply
0 Kudos
dbrutus
Enthusiast
Enthusiast

Hello, wondering if you updated the writable with the latest snapvol.cfg? If, not I recommend you copy the latest snapvol.cfg from the writable template and update one of the affected writable.

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

Hello,

I'm not sure if I fixed it 100% but definitely those cases went down quite a bit. What we have noticed based on Events in Horizon is that the writable was attached to another machine with someone else logged in to it. So technically that VM had two writables attached. So I went in and disconnected the locked writable from it and all of the sudden user was able to log in with no problems. Now reviewing events in Horizon, it appears that pool allocated VM for that user and AppVolumes started the attachment but then it appears that the process was not completed fully and someone else logged in to that VM as it was showing as available.  So what VMware told me is that the reason for those issues is that we are not forcing log off after disconnect, instead we allow users 120 minutes before machine gets logged off and refreshed. We had that setting like this since the beginning so I disagreed with them but what I found is the following article:

https://www.ituda.com/vmware-horizon-view-already-used-status-and-pae-dirtyvmpolicy/

 

SO I set all my pools with this attribute so that if machine is in that weird state it will delete itself therefore it will release user's writable. SO far, I can honestly say that I might have had one case like this for the past 2 weeks, so definitely an improvement

 

As far as template goes, after upgrading AppVolumes I usually only update templates but never really grab snapvol.cfg and update that. Any specific reason to do that?

Reply
0 Kudos
Ray_handels
Virtuoso
Virtuoso

Just here for the snapvol.cfg part. In this file (which is located in the Cloudvolumes application folder in 4.x in the GI) you select which files should be excluded from the writable. If you look into this file you will see that for example virus scan files will be excluded from the writable. It is worth a look to fix that.

If you however already update the template with the newest version the snapvol.cfg is already updated (unless you are using version 4.0 that is :)).

Just wanted to note that I don't really get the issue with keeping the user logged in after disconnect. For as far as I'm aware it should not be a problem to just leave the user in a disconnected state. You should however completely remove or refresh the machine after use!! Otherwise it could indeed still keep the "old" writable attached.

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

Hello Ray,

We are still on 2.18.6 so the only thing I was doing after upgrade is updating writable and application templates. I'm really hoping to make the move and go with AppVOlumes 4 some time this year and consolidate our images but that will depend on time available. As for refreshing desktops, we do not allow reusing VMs, we are simply allowing user to be disconnected for 120 minutes before the clone is logged off and destroyed. 

I know that VMware recommendation is to log off at the disconnect but that is not very user friendly especially if someone wants to pick up the work they started on one device and needed to reconnect from another one after a break or something else.  In our case adding the pae-DirtyVMPolicy=2 attribute to each pool seemed to have helped quite a bit. Again, not sure if it was AppVOlumes upgrade that made that worse of Horizon 7.13 but it seems like some users were able to log in to a machine that someone else was using, or another explanation is that user's logon didn't fully complete but writable was attached and never got released. From Horizon's point of view, the VM was available and another user logged in to it. What is very weird is that when I looked at the AppVOlume Manager, writable was not reported as being attached to anything. The only way I could trace it is to look at the events for a problem user and finding the original VM that was assigned and then manually detaching that writable in vsphere.

 

Reply
0 Kudos
Ray_handels
Virtuoso
Virtuoso

I have seen this issue in a very very old version. What happened there is that due to the large amount of logons and logofs we had, the Appvolumes database did a refresh every few hours (I thought it was 4) and just somehow told the database that there was no writable attached to a specific machine although the user was still logged in. When the user than logged out the writable would still be attached because the Appvolumes database had no notion of the writable still being attached. Normally that would be fixed with a simple restart of the machine as it would reconnect to the database again but this might just be something which is in that direction??

Tags (1)
Reply
0 Kudos
richiefez
Enthusiast
Enthusiast

We realized that the VMDK files for the writable volume were in locked state. Had ESXI upgrade scheduled at the same time so rebooting all ESXI hosts fixed the issue. Still not sure though why they got locked. I was told that an ESXI host in the cluster had gone down a day back. Guess this ESXi caused the issue. 

Reply
0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

Hmm interesting. We did not have upgrades scheduled or anything like that. For it just seems more like a Horizon issue allowing users to log in to machine that has been taken by someone else. Almost like the refresh of logged off VMs wasn't happening. With the attribute it seems much better though

Reply
0 Kudos