Solved: Re: UEM hanging/delaying randomly. - Page 6

Sabian0309 · ‎12-13-2018

Running horizon 7.6 with UEM 9.5 and appvol 2.14. Parent is windows 10 1607 LTSB, floating instant clones. App data is not redirected, documents/favorites/desktop/etc.. are.

Occasionally i am seeing login delays when processing UEM. On average UEM processes between 5-6 seconds. However we have some users that it takes 25+ seconds occasionally with UEM not reporting why.

Example from logs:

2018-12-12 07:57:37.331 [INFO ] Config file '\\jfv-vm-fs2\UEM Configuration\general\Applications\Acrobat Reader.INI' added to DirectFlex cache

2018-12-12 07:57:56.269 [INFO ] Config file '\\jfv-vm-fs2\UEM Configuration\general\Applications\Adobe Acrobat.INI' added to DirectFlex cache

2018-12-12 08:39:13.444 [DEBUG] ImportRegistry::Import: Calling '"C:\Windows\REGEDIT.EXE" /S "C:\Users\test\AppData\Local\Temp\FLX6AB5.tmp"' (RPAL: l=0 (D/E), r=0)

2018-12-12 08:39:13.490 [DEBUG] Read 3 entries from profile archive (size: 151258; compressed: 31466; took 64 ms; largest file: 83810 bytes; slowest import took 0 ms)

2018-12-12 08:39:32.428 [DEBUG] Conditions: Condition set 'Microsoft Office 2013.xml' was previously evaluated to true

2018-12-12 14:02:29.110 [DEBUG] Read 25 entries from profile archive (size: 63987712; compressed: 7889959; took 427 ms; largest file: 26738688 bytes; slowest import took 154 ms)

2018-12-12 14:02:48.063 [DEBUG] Conditions: Check for endpoint name = false ('C81MK6V1' is not equal to 'LJ2359H2')

I can reproduce this by logging in over and over. At times it happens, at times it does not although frequently enough it is easily reproducible. It does not appear to be related to any specific pool or UEM setting. It also isn't tied to any specific hosts, users, or type of client (thin or laptop), or version of horizon.

The UEM data is stored on a windows 2008 R2 server file share, which is a VM running off an all flash datastore on a Dell Compellent. There are no indications the Compellent it is struggling to server the data, or on the file server. This is as likely to happen during the morning login as it is when people leave for the day.

This appears to of been going on for a long while, this environment is just now getting healthy enough from the previous admin to where smaller issues like this can be tracked down. But my googlefu is failing to find any possible solutions for this.

Any help would be appreciated.

Thank you,

Billy

DEMdev · ‎06-21-2019

The conversation with Microsoft is still ongoing, but it looks like we've found the root cause. That has led to us finally being able to reproduce the delays, also without any UEM components in play.

If you're encountering this "delay when reading files from the configuration share" issue and are using non-persistent VDI, it would be great if you could test the following workaround: stop the Workstation service before the VM is deleted, for instance by configuring a shutdown script through Group Policy to run C:\Windows\System32\net.exe stop /y workstation

A few of you who have cases open with VMware support already received this request through that channel, but I'd love to get a bit more feedback. Thanks!

JohnTwilley · ‎06-21-2019

Just to ensure I'm following you... Are you saying that the VMs shutting down are causing the Read Delay on the Config Share for the VMs logging in?

And that stopping the Workstation service during shutdown helps?

Seems a bit strange. Care to elaborate?

DEMdev · ‎06-21-2019

Hi John,

You are following me perfectly

Even though the UEM agent opens its config files for read, from a read-only share, the file server grants a write caching lease. Only a single client can have such a lease, so if another client subsequently wants to read that same file, the file server tells the first client that it needs to let go of its write caching lease. Once the first client has acknowledged that request, the file server responds to the second client. (Which then gets the write caching lease, so if a third client wants to read that file, all of this happens all over again, ad infinitum.)

Write caching leases for read-only files are a bit weird, but not necessarily a problem. Having that additional traffic for "let go of your write caching lease" is not ideal, but also not a problem in and of itself.

However... If the file server can't reach the client that it granted a write caching lease to, it will try a few times, and eventually time out. Only once it times out will it respond to the second client that wants to read this file – that's the delay we're seeing.

With non-persistent VDI, VMs release their IP address at some point during the shutdown (before being deleted), which means that the SMB client might not be able to inform the SMB server that it's going away. Which subsequently means that the server will try to contact a client that's no longer around, resulting in the delay as described above.

For instance:

User logs of from VM1
- UEM path-based export is performed
- UEM agent reads config file
- SMB client gets write cache lease
- Export is completed
- VM gets deleted
User logs on to VM2
- UEM path-based import is performed
- UEM agent tries to read a config file
- File server tries to contact SMB client on VM1 to have it give up its write cache lease, but can't, as it's no longer around.
  Delay occurs.

Note that the delay can occur just as well if there are two logoffs occurring shortly after each other. I guess the focus is on delays during logon as those as much more noticeable for users.

Explicitly stopping the Workstation service effectively informs the SMB server that this client no longer needs its leases (and in our tests, this runs before the VM releases its IP address, so this information actually makes it over to the file server), preventing the server from subsequently trying to reach a client that no longer exists.

EDITED: Clarified that the problem most probably relates to the fact that an Instant Clone VM releases its IP address during shutdown, which might affect the "normal" shutdown of the Workstation service, in the sense that by then it might have lost its network connectivity. (Where "Clarified" means: I did not know this when I wrote my initial description 🙂

JohnTwilley · ‎06-21-2019

Thank you for the update. That is a very interesting idea indeed. Nice find!

Wow. It really makes since. Now WHO actually fixes the issue is another thread in itself! Microsoft? VMware?

What we need is a good 'ol fashioned band-aid. Your idea of stopping the workstation service seems reasonable. But will it interfere with any log off syncing or other tasks?

Maybe we should focus on the various options of stopping the workstation service (thus freeing the leases) as part of the shutdown process.

What have you tested so far? Just a Group Policy Logoff script?

DEMdev · ‎06-21-2019

Hi John,

This must be done from a Group Policy shutdown script. Logoff would be too soon (as you indicate), and won't have the correct permissions.

How this needs to be addressed (and by whom) is still a topic for discussion

BTW, what Horizon version are you on?

JohnTwilley · ‎06-21-2019

Horizon 7.6
Isilon SMB shares - "with the OPLock disabled on the share" (as per EMC) . This does not seem to help much.

I'm running the Shutdown Group Policy now on a few test pools to see how it handles.

It will be difficult to determine success unless you can actively monitor the leases...that's the hard part.

DEMdev · ‎06-21-2019

Hi John,

Thanks! Trying to find out whether this issue is particular to specific Horizon versions, as there might be a completely unrelated change in Horizon 7.8 that might sufficiently affect certain things to prevent the issue from occurring...

As for determining success: as I described, the delay is unrelated to UEM components. So, if you know which Isilon share you'll hit from a particular VM that you're just about to log off from, it should be pretty easy to repro from another VM (accessing the same Isilon share):

Log on to two VMs
In one, open a command prompt and "prepare" a statement like "type \\server\UEMConfigShare\General\some-config-file-that-you-have.ini". Don't press enter yet.
Log off from the other, causing a UEM export that will also read that particular INI file (receiving the write lease)
Wait until that particular VM is being deleted
In the first VM, press enter. If the timing worked out, you should now see a noticeable delay.

DEMdev · ‎06-21-2019

Re Horizon 7.8: never mind, seems that that completely unrelated change is not sufficient to prevent this issue...

ijdemes · ‎06-23-2019

I'm having the same issue in both the Horizon 7.7 and 7.8 version. I also have the issue without using Horizon. So two simple VM's with just the UEM agent installed.

\\ Ivan
---
Twitter: @ivandemes
Blog: https://www.ivandemes.com

DEMdev · ‎06-24-2019

Hi Ivan,

Yeah, I know, that one is still bugging me a bit...

Ray_handels · ‎06-24-2019

Wow, that's very interesting.

One thing I would like to ask though. We are still using Linked clones and not instant clones. Would we be facing the exact same issue as the linked clone machines are not being deleted after use, we just do a refresh after using it.

If so, I will try and get together with my colleague to see if we can test this with a few pools.

One thing to keep in mind for us is that we see this behavior the entire day and during the morning we do have a lot of people logging on but not really logging off.

Also, we saw a timestamp of about 3 minutes that it took a user to logon with the gasp in it but during that period of time a few other users were able to log in correctly without any delay. If the SMB lock would indeed be activated during shutdown of a machine wouldn't all logons during that period be slow??

ijdemes · ‎06-24-2019

Hi @Ray_handels,

Up to now, I have seen this issue taking place with both instant clones (Horizon) but also non-horizon VM's. With UEM, but also without UEM.

With regards to your last question. It may depend on which files are locked at that moment. That may be files that the user, that doesn't have the issue at that moment, doesn't have permissions for. Maybe that's the case?

Something I also tested (at the request of UEMdev) is using a Windows Server 2019 or Windows 10 (v1809) file share and disable leasing for that share (Set-SMBShare -LeasingMode None). This also results in the issues NOT taking place anymore. Not the solution for everyone, but just to provide another option for some.

\\ Ivan
---
Twitter: @ivandemes
Blog: https://www.ivandemes.com

DEMdev · ‎06-24-2019

Hi Ray,

We are still using Linked clones and not instant clones. Would we be facing the exact same issue as the linked clone machines are not being deleted after use, we just do a refresh after using it.

I pretty much have no clue whatsoever about Horizon, so I just checked with, umm, sort-of-HorizonDev . They indicated that the shutdown process for Linked Clones would be similar, so I'd love it if you could try the workaround.

As for the different delays you're seeing: I'm not quite sure, to be honest. A single machine shutting down can cause the delay for multiple clients, and leases also expire on the server side after a little while, so the exact symptoms are rather sensitive to timing.

JohnTwilley · ‎06-24-2019

UEMdev

I was wondering... What if we had other options for accessing the UEM Configuration data that was not dependent on CIF/SMB? Like Https, FTP, WebDav, Whatever.

That would resolve the File Leasing issue. It would complicate the environment a little, but for these large enterprise installations it's no big deal.

We just need fast, dependable, performance.

My nurses don't care about my File Leasing performance issues. I care about these issues for them.

DEMdev · ‎06-24-2019

Hi John,

Although that would be a way to not encounter this particular SMB-related issue, it feels a bit overkill to completely rearchitect the product (and get rid of one of the aspects that makes it so easy to implement) to deal with a Microsoft or Horizon issue

sjesse · ‎06-24-2019

Not to get off topic, but I have been secretly hoping that a move from smb would happen into a html or even a rest api for managing the configurations, this opens the option to allow more automation and management that isn't possible right now.

DEMdev · ‎06-24-2019

Hi sjesse,

We'd love to hear your (detailed ) thoughts around that, but preferably in a separate thread. Thanks!

Sabian0309 · ‎06-24-2019

This is awesome. I have tested this in my testing pool and will be rolling this out over the next week. So far it seems to be doing the trick. Your rock sir.

DEMdev · ‎06-25-2019

Hi Sabian0309,

Very happy to hear that initial results are looking good for you! Thank you for your patience, and for all the tests you've performed over the last half year.

Ray_handels · ‎06-27-2019

Just posting here to tell that we tested the solution provided to us by UEMdev and indeed the issue is fixed. At least we don't see oplocks anymore on the configuration files. Now we still need to fix some slow imports but hey, one thing at a time .....

Also, not only does it fix the issue, we also don;t see any other issues with the fix. We are using View 7.5.1, Appvolumes 2.16. UEM 9.4 with W10 1703 and vGPU. So it seems that this setting can be used safely.

All

UEM hanging/delaying randomly.