Solved: Re: UEM hanging/delaying randomly. - Page 2

Sabian0309 · ‎12-13-2018

Running horizon 7.6 with UEM 9.5 and appvol 2.14. Parent is windows 10 1607 LTSB, floating instant clones. App data is not redirected, documents/favorites/desktop/etc.. are.

Occasionally i am seeing login delays when processing UEM. On average UEM processes between 5-6 seconds. However we have some users that it takes 25+ seconds occasionally with UEM not reporting why.

Example from logs:

2018-12-12 07:57:37.331 [INFO ] Config file '\\jfv-vm-fs2\UEM Configuration\general\Applications\Acrobat Reader.INI' added to DirectFlex cache

2018-12-12 07:57:56.269 [INFO ] Config file '\\jfv-vm-fs2\UEM Configuration\general\Applications\Adobe Acrobat.INI' added to DirectFlex cache

2018-12-12 08:39:13.444 [DEBUG] ImportRegistry::Import: Calling '"C:\Windows\REGEDIT.EXE" /S "C:\Users\test\AppData\Local\Temp\FLX6AB5.tmp"' (RPAL: l=0 (D/E), r=0)

2018-12-12 08:39:13.490 [DEBUG] Read 3 entries from profile archive (size: 151258; compressed: 31466; took 64 ms; largest file: 83810 bytes; slowest import took 0 ms)

2018-12-12 08:39:32.428 [DEBUG] Conditions: Condition set 'Microsoft Office 2013.xml' was previously evaluated to true

2018-12-12 14:02:29.110 [DEBUG] Read 25 entries from profile archive (size: 63987712; compressed: 7889959; took 427 ms; largest file: 26738688 bytes; slowest import took 154 ms)

2018-12-12 14:02:48.063 [DEBUG] Conditions: Check for endpoint name = false ('C81MK6V1' is not equal to 'LJ2359H2')

I can reproduce this by logging in over and over. At times it happens, at times it does not although frequently enough it is easily reproducible. It does not appear to be related to any specific pool or UEM setting. It also isn't tied to any specific hosts, users, or type of client (thin or laptop), or version of horizon.

The UEM data is stored on a windows 2008 R2 server file share, which is a VM running off an all flash datastore on a Dell Compellent. There are no indications the Compellent it is struggling to server the data, or on the file server. This is as likely to happen during the morning login as it is when people leave for the day.

This appears to of been going on for a long while, this environment is just now getting healthy enough from the previous admin to where smaller issues like this can be tracked down. But my googlefu is failing to find any possible solutions for this.

Any help would be appreciated.

Thank you,

Billy

Sabian0309 · ‎01-24-2019

Ray,

I do not see any firewall events within our svservice.log, we also have the firewall disabled on our master and appvolumes parent images and do not have a cvfirewall.cfg I do see the same empty space, however it happens here:

(flexengine)

2019-01-24 08:36:06.106 [INFO ] Importing profile archive 'Mozilla Firefox.zip' (\\jfv-vm-fs2\UEM Profiles\bbombatch\archives\Applications\Mozilla Firefox.zip)
2019-01-24 08:36:06.700 [DEBUG] Read 252 entries from profile archive (size: 24769993; compressed: 7937162; took 586 ms; largest file: 6180184 bytes; slowest import took 51 ms)
2019-01-24 08:36:25.622 [INFO ] No profile archive to import for config file '\\jfv-vm-fs2\UEM Configuration\general\Applications\MP2.INI'
2019-01-24 08:36:25.638 [INFO ] Config file '\\jfv-vm-fs2\UEM Configuration\general\Applications\Notepad.INI' added to DirectFlex cache

(svservice)

[2019-01-24 13:36:02.731 UTC] [svservice:P1252:T2420] All volumes finished processing
[2019-01-24 13:36:02.731 UTC] [svservice:P1252:T2420] User has not configured for checking of VM persistence
[2019-01-24 13:36:02.731 UTC] [svservice:P1252:T4548] User has not configured for checking of Windows Update Service status
[2019-01-24 13:36:29.017 UTC] [svservice:P1252:T3176] OnStartShell called (Session ID 1, Handle 0000017143319C00, Params 0000003FDD0FE518, Context 0000000000000000)
[2019-01-24 13:36:29.017 UTC] [svservice:P1252:T3176] OnStartShell: APOLLO\XXXXXXXX (NameSamCompatible)

I've actually ran through and did some benchmark testing of logins to try and support this is a UEM related issue and not appvol or esxi. I did 29 logins with just horizon, horizon+uem, horizon+uem+appvol and horizon+appvol. The only place i see large inconsistencies are with UEM involved:

Agent	UEM	UEM+APP Vol	App vol
12	38	49	26
13	17	27	25
14	18	30	26
14	19	36	33
14	19	49	23
14	23	51	22
14	31	50	26
14	33	53	25
14	35	37	24
14	36	33	25
14	37	44	22
14	37	47	26
14	38	31	22
14	38	52	24
14	41	30	30
14	41	34	25
15	19	31	24
15	23	30	24
15	38	29	25
15	38	47	24
15	49	29	25
16	35	41	26
16	40	30	23
16	41	29	26
17	17	39	23
17	20	50	25
17	36	53	25
14	36	36	24
15	38	46	28

But regardless i appreciate the information

Sabian0309 · ‎01-24-2019

I've uploaded a zip to the SR with an example of a delay and what looks to be regular. Thank you sir.

DEMdev · ‎01-24-2019

Thanks, Sabian0309, I've received the files from support and am looking at them now.

Edit 1: Intermediate analysis result:

Just opening that INI file (which is a single call from the UEM agent to the Win32 CreateFile() API) takes 18.9s... We might need to get a Windows Performance Recorder trace to dive deeper into what's consuming this time, but I'll continue my ProcMon-based research for a little while longer first.

Edit 2: No luck in finding more details in the ProcMon log on where those 18.9s go... Is there anything you can see on the file server side that might explain this?

Sabian0309 · ‎01-28-2019

Sadly no. I've watched our file server without any issue. Procmon doesn't have any spikes in anything relevant. I actually made a script that ran on a VM copying 100 little ini files to the uem share, read a few, then copy them back to the VM locally and delete. I let that run for almost 4 hours without any delay being produced.

I'm open to suggestions on trying to replicate the issue outside of UEM.

DEMdev · ‎01-29-2019

Thanks, Sabian0309, support informed me about those tests – thank you for running them.

I'm currently looking into the ETW trace you provided, to see if I can find out more about what's going on in Windows during the delay. Let's run this through support channels, and then we can update this thread once we know more.

Serg01 · ‎01-29-2019

I have the same issue.

(disabled port 0x2000031

Disable Rx queuing; queue size 512 is larger than Vmxnet3RxQueueLimit limit of 64.

enabled port 0x2000031)

I found discussion on reddit:

"So it seems that ESXi 6.5 contains undocumented networking feature "Burst Queue". If burst queuing is enabled it allows vNic RX Ring to overflow to this burst queue to avoid packet drops.

ethernetX.rxBurstQueueLength sets burst queue size in actual bytes. For some reason "Disable Rx queuing" message goes away related to this setting. To test this I set ethernet1 and ethernet2.rxBurstQueueLength value to 8192 for my home lab firewall and I saturated FW interfaces by flooding it with ICMP, UDP and TCP using nping. By adding some CPU Limits to FW VM I was able to generate some RX Ring overflowing. Instead of packet drops I saw packets going into rx burst queue and from there back to normal queue. Not a single packet drop was shown by vsish so burst queue seems to work very well."

Reddit - vmware - Received packet drops (Guest)

hschimpf · ‎01-31-2019

We are facing the same issue. We have 10-20 second gaps where it seems that Nothing at all is Happening. The tool you screenshotted Looks helpful for analyzing this issue. Can you tell me which tool this is?

DEMdev · ‎01-31-2019

Hi hschimpf,

That's a screenshot of SysInternals Process Monitor. It can be used to collect a trace at logon, but that's tricky to do in non-persistent environments. However, if you can reproduce the delay during a session, you can just run ProcMon interactively.

robsisk1972 · ‎01-31-2019

Sabian0309 Thanks for sticking with this. I am eagerly awaiting a solution.

Thanks

Rob

Rdiaz29 · ‎02-21-2019

I am currently on Horizon View 7.7, UEM 9.6 and Instant Clones. We are also experiencing slow logins.The problem seems to be UEM related sometimes as indicated by the UEM "FlexEngine.log" and by the Group Policy log. According to the UEM log, the "VMware UEM FlexEngine Extension Processing" sometimes takes over 90 seconds for us. That same time is reported in the group policy log in Event Viewer (EventID #5016). With DEBUG enabled for UEM logging, the logs do have the time gaps that explain the delay, but nothing that explains why. Sometimes the GPO logs say that Folder Redirection takes a long time to process instead of UEM. I am unable to upload log files since I am on an isolated system that does not allow export of logs.

I have played around with Windows 10 1803 and 1809, View 7.5.1 and 7.7, stripping out all GPOs except for the one that applies UEM policy and Folder Redirection. I have also tried a Windows File Server as opposed to our production Netapp storage array (which is all flash). I have another VDI using the exact image, GPOs, Horizon View version and UEM version, but I am using Linked Clones. So I decided to try a Linked Clone pool. It appears, the Linked Clone pool does not have this slow login issue. Only the Instant Clone pool does which can be replicated by logging out of a desktop, and logging into within a few seconds after.

Sabian0309 · ‎03-21-2019

Well, i think that this point that the conversation is dead. I've gone as far as catching the slow down and suspending the VM to send over suspended memory files and have been told that I would need to contact Microsoft. VMware is stating that it is an infrastructure issue, but cannot provide any method to prove it aside from contacting an outside vendor. My manager doesn't want to waste the money opening a support case with Microsoft for something that happens part of the time.

Sorry guys, i was hoping i could find the smoking gun. I'll still poke at this issue in my spare time and if i ever come up with something i'll post it on here. I guess this would be the reason why all other cases with this issue are closed as "Client stopped responding". Guess we all have the same infrastructure issue.

Ray_handels · ‎04-16-2019

Okay so we are seeing the exact same issue now as well with a 30 second gap.

My remark regarding the firewall restart can be safely ignored as I was able to place both log files next to eachother and the moment the firts 30 second gap was there Appvolumes was still processing one of it's appstacks. At the end of the Appvolumes log (which is done within about 30 seconds) i can see the firewall restart service and a gap of more than 2 minutes in which Appvolumes waits for "Windows" for it to release the desktop.

When looking at the UEM log I can see that it is waiting for over 2 minutes for a few very small .ini files

We too are using McAfee Move but are on 9.4 UEM, 2.16 Appvolumes and 6.0 U3 ESXi.

Is anyone still looking into this issue?? Is there an SR that we can refer to??

Here is a small snippit of the log file (changes server names and share)

2019-04-16 12:35:30.084 [DEBUG] Conditions: Check for user membership of group 'DOMAIN\GROUPNAME' = false

2019-04-16 12:35:30.084 [INFO ] Skipping import for config file '\\servername\CONF_SHARE$\General\Applications\Base Applications\Mitel\Mitel.INI' due to conditions

2019-04-16 12:36:02.385 [INFO ] Config file '\\servername\CONF_SHARE$\General\Applications\Base Applications\NetDrive\NetDrive.ini' added to DirectFlex cache

2019-04-16 12:36:37.392 [INFO ] Config file '\\servername\CONF_SHARE$\General\Applications\Base Applications\NetOpVision\NetOpVision.ini' added to DirectFlex cache

2019-04-16 12:36:38.997 [INFO ] Config file '\\servername\CONF_SHARE$\General\Applications\Base Applications\Noldus\FaceReader\FaceReader.ini' added to DirectFlex cache

2019-04-16 12:37:14.005 [INFO ] Config file '\\servername\CONF_SHARE$\General\Applications\Base Applications\Noldus\Observer\Observer.ini' added to DirectFlex cache

As I was looking at the log file something popped my mind, what is the DirectFlex cache? Is this stored on the local machine? If so, could it be that we need to add an exclusion for this in McAfee? I do need to know what location it is written to.

DEMdev · ‎04-16-2019

Hi Ray_handels,

The DirectFlex cache is just a registry value at HCKU\Software\Immidio\Flex Profiles\DirectFlex.

Such long delays are typically caused by timeouts on storage or elsewhere in infrastructure. What type of storage do you use? Also, assuming that this wasn't always the case: any idea what changed that might have caused this?

Ray_handels · ‎04-16-2019

Yes, that's where things get itchy .

We are using HNAS for data storage but we do not manage it ourselves so I have to ask my collegues if they are able to see something there. As always, officially nothing has changed ..

I will try your suggestion to recreate this while logged in and run ProcMon to see what happens.

Any else ever found a solution? Being it UEM or infrastructure?

DEMdev · ‎04-17-2019

Hi Ray_handels,

Yes, if you can reproduce these delays during a session, that might help shed some more light on it, with ProcMon or WPR traces.

Are these delays you're (now) seeing occurring consistently, or is it random?

In the cases we've seen so far, we have not been able to pinpoint a culprit. All we've seen is that very straightforward API calls from UEM to open a file ("plain" I/O calls, leaving it completely up to Windows to talk SMB etc.) can occasionally take a long time. Diving into the stack traces for those calls just leads to SMB and network components seemingly doing their job normally, and then having to wait for the request to be completed.

I'd love to get to the bottom of this, but at this time I don't see anything pointing to UEM...

ijdemes · ‎04-18-2019

Hi UEMdev and @Ray_handels,

I think I am experiencing the same issue. Both at the customer and at home in my lab environment. In my home lab I ruled out anti-virus by disabling it all together on both client and file server. But still receiving strange time gaps of about 20 seconds.

I also tried to go with the "do not use group policy extension" option and run FlexEngine.exe -r from the GPO/Logon part and use the splash screen to see progress. But also there, it is sometimes stuck for a while and then continues.

When I run FlexEngine.exe -r multiple times after that, there is not a single delay.

In my home lab I am running Win10 v1809 LTSC, UEM 9.7 (Quick Start settings implemented only), Horizon 7.8, No App Volumes, Local Profile.

Now, I am not saying this all is caused by UEM, but at least I think something is impacting UEM or SMB actions during, or shortly after logon. It looks as if something is temporarily suspending the FlexEngine process.

I am still investigating and monitoring this thread in the mean time.

\\ Ivan
---
Twitter: @ivandemes
Blog: https://www.ivandemes.com

Ray_handels · ‎04-18-2019

We are also looking into this and, for as far as I can see, UEMDev is really committed in this. My guess is that we will find a solution somehow.

This afternoon we will be looking into this with our storage administrator. What I hope to find out is if he can see when the request for the file action is being send. Hopefully we are able to pinpoint if it is the storage (or SMB or whatever it may be) that is causing the delay or maybe it is UEM in some way.

Looking at your version we can also rule out Appvolumes as the culprit and the Windows 10 version (although W10 is not the best OS Microsoft has ever written but let's not get started on that one ).

The strange thing is that we have heard of a 20 second and a 30 second gap. for us it is 30 seconds and always seems to be 30 seconds. My guess is you are also using VMXNet3 adapter? What version of VMWare Tools are you using and what driver version does the VMXNET nic have? Ours is 1.7.3.8

ijdemes · ‎04-18-2019

I am runnning vSphere 6.7 U1, VMware Tools 10338, vmxnet3 driver version 1.8.3.1.

When it happens, it is around 20 seconds every time.

\\ Ivan
---
Twitter: @ivandemes
Blog: https://www.ivandemes.com

DEMdev · ‎04-18-2019

Hi ijdemes,

I think something is impacting UEM or SMB actions during, or shortly after logon

As much as I don't understand what exactly is going here, that's my hypothesis as well. It seems that somehow the pattern of UEM's Win32 file I/O calls triggers some kind of timeout somewhere. Still completely unsure on whether that is on the agent side, or on the file server side.

It's odd/interesting that running flexengine.exe -r in the session does not result in these delays, as in Sabian0309's case he's able to reproduce it that way as well. Ray_handels: can you repro in the session, or just at logon?

ijdemes: would you be able to collect a ProcMon trace to see if the symptoms of your delays look similar to what we've analyzed so far?

DEMdev · ‎04-18-2019

Ray_handels,

The strange thing is that we have heard of a 20 second and a 30 second gap. for us it is 30 seconds and always seems to be 30 seconds.

In previous reports (some of which might well have been caused by antivirus products), we've mostly seen "nice, round numbers" like 15s and 30s, but also something very specific like 18.9s. That lack of consistency can maybe be explained by different protocol implementations from different vendors, but that's just a guess, as we still have no clue about the root cause of these delays 😞