Solved: Re: UEM hanging/delaying randomly.

Sabian0309 · ‎12-13-2018

Running horizon 7.6 with UEM 9.5 and appvol 2.14. Parent is windows 10 1607 LTSB, floating instant clones. App data is not redirected, documents/favorites/desktop/etc.. are.

Occasionally i am seeing login delays when processing UEM. On average UEM processes between 5-6 seconds. However we have some users that it takes 25+ seconds occasionally with UEM not reporting why.

Example from logs:

2018-12-12 07:57:37.331 [INFO ] Config file '\\jfv-vm-fs2\UEM Configuration\general\Applications\Acrobat Reader.INI' added to DirectFlex cache

2018-12-12 07:57:56.269 [INFO ] Config file '\\jfv-vm-fs2\UEM Configuration\general\Applications\Adobe Acrobat.INI' added to DirectFlex cache

2018-12-12 08:39:13.444 [DEBUG] ImportRegistry::Import: Calling '"C:\Windows\REGEDIT.EXE" /S "C:\Users\test\AppData\Local\Temp\FLX6AB5.tmp"' (RPAL: l=0 (D/E), r=0)

2018-12-12 08:39:13.490 [DEBUG] Read 3 entries from profile archive (size: 151258; compressed: 31466; took 64 ms; largest file: 83810 bytes; slowest import took 0 ms)

2018-12-12 08:39:32.428 [DEBUG] Conditions: Condition set 'Microsoft Office 2013.xml' was previously evaluated to true

2018-12-12 14:02:29.110 [DEBUG] Read 25 entries from profile archive (size: 63987712; compressed: 7889959; took 427 ms; largest file: 26738688 bytes; slowest import took 154 ms)

2018-12-12 14:02:48.063 [DEBUG] Conditions: Check for endpoint name = false ('C81MK6V1' is not equal to 'LJ2359H2')

I can reproduce this by logging in over and over. At times it happens, at times it does not although frequently enough it is easily reproducible. It does not appear to be related to any specific pool or UEM setting. It also isn't tied to any specific hosts, users, or type of client (thin or laptop), or version of horizon.

The UEM data is stored on a windows 2008 R2 server file share, which is a VM running off an all flash datastore on a Dell Compellent. There are no indications the Compellent it is struggling to server the data, or on the file server. This is as likely to happen during the morning login as it is when people leave for the day.

This appears to of been going on for a long while, this environment is just now getting healthy enough from the previous admin to where smaller issues like this can be tracked down. But my googlefu is failing to find any possible solutions for this.

Any help would be appreciated.

Thank you,

Billy

DEMdev · ‎06-21-2019

The conversation with Microsoft is still ongoing, but it looks like we've found the root cause. That has led to us finally being able to reproduce the delays, also without any UEM components in play.

If you're encountering this "delay when reading files from the configuration share" issue and are using non-persistent VDI, it would be great if you could test the following workaround: stop the Workstation service before the VM is deleted, for instance by configuring a shutdown script through Group Policy to run C:\Windows\System32\net.exe stop /y workstation

A few of you who have cases open with VMware support already received this request through that channel, but I'd love to get a bit more feedback. Thanks!

View solution in original post

DEMdev · ‎12-13-2018

Hi Sabian0309,

Thank you for the very relevant log snippets. UEM is hardly doing anything between the "first" and "second" log line in these highlighted pairs; it's just reading a .INI file from the configuration share.

We occasionally see delays like this, and typically it seems to point to some storage (or connectivity to storage) issue, or anti-malware software getting in the way. For the latter, there's this post on the Tech Zone. For the former, I do not have anything more concrete...

Is this issue also reproducible within the session, for instance when you perform a DirectFlex refresh? If so, maybe a ProcMon or Windows Performance Recorder trace can shed some light on this.

Sabian0309 · ‎12-13-2018

Oddly enough, i have yet to see it happen with direct flex. I've spent a bit of time today launching apps that have a directflex configuration, however it never ceases to launch almost immediately.

This isn't too big of an environment, our VDI may hit 125 concurrent users. The latency and network utilization for the san is low for all datastores. The compellent is connected via 4X10GB fiber ports to a 2 Force10 switches in the back of our m1000e chassis.

We use McAfee move for our VDI and have setup all of the exceptions, I've actually read the article you've referenced. While i was fixing this environment i was crawling this forum and googlefuing a bit .

I'm going to get some time stamps and start looking through the logs on the hosts themselves and see if i can glean anything from it. The behavior is odd to say the least, and there has to be something somewhere that can point me in the right direction.

Appreciate your feedback though.

Sabian0309 · ‎12-13-2018

So a bit of an odd thing...

So when i connect to a VM that is already up and running i don't seem to see an issue.

However if it is freshly spun up i see the following:

Flex log-

2018-12-13 16:09:38.048 [DEBUG] ImportRegistry::Import: Calling '"C:\Windows\REGEDIT.EXE" /S "C:\Users\xxxxxx~1\AppData\Local\Temp\FLX16A1.tmp"' (RPAL: l=0 (D/E), r=0)

2018-12-13 16:09:47.594 [DEBUG] Read 20 entries from profile archive (size: 49135; compressed: 8277; took 9557 ms; largest file: 46826 bytes; slowest import took 0 ms)

2018-12-13 16:09:47.594 [DEBUG] Conditions: Check for endpoint name = true ('LJ2359H2' is equal to 'LJ2359H2')

So there is a delay from 16:09:38 -16:09-47. Now when i checked the host:

vmkernel

2018-12-13T21:09:33.960Z cpu17:1943395)Net: 2324: connected Test05 eth0 to Virtual Machine Network, portID 0x40002f6

2018-12-13T21:09:33.963Z cpu17:1943395)Vmxnet3: 17293: Disable Rx queuing; queue size 512 is larger than Vmxnet3RxQueueLimit limit of 64.

2018-12-13T21:09:33.964Z cpu17:1943395)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x40002f6

2018-12-13T21:09:33.964Z cpu17:1943395)NetPort: 1662: enabled port 0x40002f6 with mac 00:50:56:8f:b2:43

2018-12-13T21:09:34.225Z cpu28:1943395)Hbr: 3489: Migration end received (worldID=2139992) (migrateType=7) (event=1) (isSource=0) (sharedConfig=0)

2018-12-13T21:09:34.226Z cpu1:74341)Config: 706: "SIOControlFlag2" = 0, Old Value: 1, (Status: 0x0)

2018-12-13T21:09:38.307Z cpu22:1943398)NetPort: 1881: disabled port 0x40002f6

2018-12-13T21:09:40.496Z cpu19:1943399)Vmxnet3: 17293: Disable Rx queuing; queue size 512 is larger than Vmxnet3RxQueueLimit limit of 64.

2018-12-13T21:09:40.496Z cpu19:1943399)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x40002f6

2018-12-13T21:09:40.496Z cpu19:1943399)NetPort: 1662: enabled port 0x40002f6 with mac 00:50:56:8f:b2:43

Please note that this is 5 hours ahead, so 21:09:38 is actually 16:09:38. The port the VM is connected to is disabled, then enabled a short time later. I don't think this is a coincidence as the timing is dead on. Not to mention the fact that the majority of these delays are about the same time... but i digress.

I've opened a case with vmware regarding this, and I'll see where that goes. When i have more time tomorrow I'll test through and see if this is what is causing the issue as i don't like a single example .

LukaszDziwisz · ‎12-13-2018

I'm seeing exactly the same issue in our environment. We are Floating Instant Clones with Horizon 7.6, vsphere 6.7, UEM 9.5 and AppVolumes 2.14.2, all flash storage, Windows 10 LTSB 1607. Let me know if you don't mind what is the result of the VMware call

schepp · ‎12-14-2018

Hey,

got the same issue.

Horizon 7.6, vsphere 6.7 U1, App Volumes 2.14.2 UEM 9.5

Windows 10 1709

All Flash Dell Compellent Storages

DEMdev · ‎12-14-2018

Hi Sabian0309,

2018-12-13T21:09:38.307Z cpu22:1943398)NetPort: 1881: disabled port 0x40002f6

2018-12-13T21:09:40.496Z cpu19:1943399)Vmxnet3: 17293: Disable Rx queuing; queue size 512 is larger than Vmxnet3RxQueueLimit limit of 64.

2018-12-13T21:09:40.496Z cpu19:1943399)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x40002f6

2018-12-13T21:09:40.496Z cpu19:1943399)NetPort: 1662: enabled port 0x40002f6 with mac 00:50:56:8f:b2:43

That is very interesting, indeed! Would you mind (privately) sharing the SR number so I can keep an eye on this?

Sabian0309 · ‎01-18-2019

Its been a while, but i'll give an update.

I've spend a bit of time on the phone with the ESX support team going through trying to capture what is causing this. The host isn't logging any issue, and its happening before tools is running so there isn't anything there.

I really don't think its an ESX problem. But now i'm curious on the side of UEM and where the user archives are hosted. For those who have this issue, where do your archives reside?

LukaszDziwisz · ‎01-18-2019

My archives reside on the UEM Manager server. I created UEM share out there that consists of UEMConfig and UEMShare. So pretty much the location is as follows \\domain.com\uem\uemshare\%username%\Archives

Sabian0309 · ‎01-18-2019

What OS is your management server running on?

robsisk1972 · ‎01-22-2019

I too experience the same 30 second delay while the UEM does nothing. I also use Mcafee Move as our agent-less scanner. I have yet to put our exclusions in but will soon. Of the several people whom are having this issue, how many are using Mcafee Move? Hoping to find a pattern that may lead to a resolution.

Sabian0309 · ‎01-22-2019

We run McAfee Move as well. We've set up all the exclusions and even involved McAfee. While i don't think that is the cause, i won't say don't look into it. From my testing the delay only occurs when UEM is running on the VM. I'm leaning towards an issue with the VM communicating with the windows fileshare that has the data.

After a long while talking with ESX support, i don't think its a hypervisor issue.

LukaszDziwisz · ‎01-22-2019

Windows Server 2016 DC

DEMdev · ‎01-23-2019

Sabian0309, robsisk1972,

Do you only experience this delay at logon, or does it also happen during the session (while performing a DirectFlex refresh or UEM refresh)? The reason I'm asking is that in the latter scenario, it will probably be somewhat easier to get insight into what's going on through a ProcMon or WPR trace.

Sabian0309 · ‎01-23-2019

I've read through a lot of flex logs and never seen a direct flex process take more than a hundred or so MS. I've launched a bunch of our applications that also run directflex and never had that sort of delay. It is just odd how the delay is almost always 19 seconds.

DEMdev · ‎01-23-2019

Hi Sabian0309,

I meant a DirectFlex refresh (i.e. running "FlexEngine.exe -DirectFlexRefresh") rather than a DirectFlex import or export. That performs some of the config reading and evaluation that happens when UEM runs during logon.

Those 19 seconds are indeed weird, and (based on a screenshot I saw in an internal email thread about your ESX case) if it occurs between logging "Read 12 entries from profile archive..." and "Skipping retired config file ..." that means that just reading a tiny .INI file took all that time. That strongly points to something in the infrastructure, and if we can reproduce the delay while ProcMon or WPR is running, that would provide more info on what exactly is going on during that time.

robsisk1972 · ‎01-23-2019

I'll try and find some time later this week to test if the 30 second delay occurs on a directflex refresh. Busy with Office 365 pro plus now. If need be, I can take down the Mcafee Move appliances in my environment and test as I am not yet in production. Getting really close though.

Sabian0309 · ‎01-23-2019

Running a direct flex refresh did not show any issue after 500+ runs (i make a simple bat file to run it 50 times then checked the log).

If i do a full refresh (flexengine.exe -r) I see times varying between 19 seconds and 30+, and can do a procmon while this happens. Anything specific to look for?

DEMdev · ‎01-24-2019

Hi Sabian0309,

If you can capture a ProcMon trace of a FlexEngine -r run where the FlexEngine log shows the delay, that would be great. Please upload the full .PML file and the corresponding FlexEngine log at log level DEBUG to your SR ending with 6512.

Ray_handels · ‎01-24-2019

Okay, maybe I'm waaaay of here but I'll try to chime in my idea. To be honest we are not seeing this right now, we did have a some users that had slow logon times and a black screen after login. Looking at the firewall thing and seeing that everyone is using Appvolumes it might be an issue with that.

What appvolumes does during logon is that it restarts the firewall service and makes an export of the firewall in a config file, adds specific rules from appstacks and injects that new policy within the virtual machine. In a healthy enviornment this takes up to 20 seconds to execute (the 20 second gap you guys see), in a very unhealthy enviornment it can take up to 3 minutes leaving the user with a black screen at logon.

Could you please try and check the following file C:\Program Files (x86)\CloudVolumes\Agent\Logs\svservice.log? (You will also see the cvfirewall.cfg file that hgas been exported during login)

And then search for the following lines?

As you can see there is a 28 second gap here in my environment. Could you check and see if timestamps are the same??

[2019-01-24 07:08:42.063 UTC] [svservice:P1700:T4400] RunCommandLineFromService: Launching CommandLine C:\Windows\system32\cmd.exe /c del /q /f "C:\Program Files (x86)\CloudVolumes\Agent\Logs\cvfirewall.cfg" 2>NUL && netsh advfirewall export "C:\Program Files (x86)\CloudVolumes\Agent\Logs\cvfirewall.cfg" && netsh advfirewall import "C:\Program Files (x86)\CloudVolumes\Agent\Logs\cvfirewall.cfg" (CreationFlag 134217760, wait -1 ms)

[2019-01-24 07:08:42.064 UTC] [svservice:P1700:T3768] User has not configured for checking of Windows Update Service status

[2019-01-24 07:08:42.068 UTC] [svservice:P1700:T4400] Successfully launched: C:\Windows\system32\cmd.exe /c del /q /f "C:\Program Files (x86)\CloudVolumes\Agent\Logs\cvfirewall.cfg" 2>NUL && netsh advfirewall export "C:\Program Files (x86)\CloudVolumes\Agent\Logs\cvfirewall.cfg" && netsh advfirewall import "C:\Program Files (x86)\CloudVolumes\Agent\Logs\cvfirewall.cfg" (wait -1 ms)

[2019-01-24 07:09:10.125 UTC] [svservice:P1700:T5232] OnStartShell called (Session ID 1, Handle 000001D7887C4F90, Params 0000005BD92FECC8, Context 0000000000000000)