beachITguy
Contributor
Contributor

Users randomly getting disconnected

Hello,
My environment is experiencing issues with users at random getting disconnected from their VDI desktops
I have experienced this myself and I can tell you what happened to my session when I get disconnected.

I am currently using Horizon View Client
Version: 2111
Build: 8.4.0 (189968194)

A disconnect happened to me this morning and all I had open application wise was Edge, Word, MS Teams, Notepad++.

I was typing in the notepad++ and the screen just froze, no longer accepting inputs, cannot change screens.
This stays this way for about 30 seconds to a minute and then I get booted to the view client screen to log back into the desktop pool
I am able to get back into the desktop after about a minute or two and it logs me back into the exact same desktop with all the windows I had open, remained open

I am the Network Engineer and I can confirm that we have the necessary ports open to the FW to allow traffic thru. This start happening about a two weeks ago, and before that everything work normally.

I am not in charge of the configuration of the connection servers or the desktops so I would not be able to answer any questions related to those.

But like I said this happens randomly to random people thru out the day. We cannot replicate it at all which is probably one of the most
frustrating aspects of troubleshooting this.

I have gotten the pcoip_server logs from the VDI desktop I am logged into and have attached them here. I have had to remove/replace some info that was specific to my org, (FQDN and IPs) but nothing else was redacted.

Any help anyone can offer would be appreciated.

0 Kudos
12 Replies
nimzobob
Contributor
Contributor

I am seeing the same thing with the same version

Found this in one of the logs - none of the other logs show anything as far as I can tell.

vmware-crtbora-9076.log

2022-03-10T10:11:59.464Z In(05) crtbora crt::common::MKS::GetConnectionStateReason(): remote mks disconnect reason code is 29.
2022-03-10T10:11:59.464Z In(05) crtbora crt::common::MKS::SetConnectionState: MKS connection state changes from 2 to 1.
2022-03-10T10:11:59.464Z In(05) crtbora crt::common::MKS::GetConnectionStateReason(): remote mks disconnect reason code is 29.
2022-03-10T10:11:59.469Z In(05) crtbora crt::common::MKS::OnConnectionStateChanged: remote mks set disconnect reason 29, so attempting to reconnect with retry count = 1 and duration = 2 sec.
2022-03-10T10:11:59.484Z In(05) crtbora crt::win32::MainMKSWindow::SetLockedDPI: Customized DPI :0 is set.
2022-03-10T10:11:59.485Z In(05) crtbora crt::win32::MainMKSWindow::SetLockedDPI: Customized DPI :0 is set.
2022-03-10T10:12:04.567Z In(05) crtbora crt::common::MKS::GetConnectionStateReason(): remote mks disconnect reason code is 29.
2022-03-10T10:12:04.567Z In(05) crtbora crt::common::MKS::SetConnectionState: MKS connection state changes from 1 to 1.
2022-03-10T10:12:04.567Z In(05) crtbora crt::common::MKS::OnConnectionStateChanged: remote mks set disconnect reason 29, so attempting to reconnect with retry count = 2 and duration = 4 sec.
2022-03-10T10:12:11.674Z In(05) crtbora crt::common::MKS::GetConnectionStateReason(): remote mks disconnect reason code is 29.
2022-03-10T10:12:11.674Z In(05) crtbora crt::common::MKS::SetConnectionState: MKS connection state changes from 1 to 1.
2022-03-10T10:12:11.674Z In(05) crtbora crt::common::MKS::OnConnectionStateChanged: remote mks set disconnect reason 29, so attempting to reconnect with retry count = 3 and duration = 8 sec.
2022-03-10T10:12:17.812Z In(05) crtbora crt::win32::MainMKSWindow::SetLockedDPI: Customized DPI :0 is set.
2022-03-10T10:12:22.772Z In(05) crtbora crt::common::MKS::GetConnectionStateReason(): remote mks disconnect reason code is 29.
2022-03-10T10:12:22.772Z In(05) crtbora crt::common::MKS::SetConnectionState: MKS connection state changes from 1 to 1.
2022-03-10T10:12:22.772Z In(05) crtbora crt::common::MKS::OnConnectionStateChanged: remote mks set disconnect reason 29, so attempting to reconnect with retry count = 4 and duration = 8 sec.
2022-03-10T10:12:32.795Z In(05) crtbora crt::win32::MainMKSWindow::SetLockedDPI: Customized DPI :0 is set.
2022-03-10T10:12:33.878Z In(05) crtbora crt::common::MKS::GetConnectionStateReason(): remote mks disconnect reason code is 29.
2022-03-10T10:12:33.878Z In(05) crtbora crt::common::MKS::SetConnectionState: MKS connection state changes from 1 to 1.
2022-03-10T10:12:33.878Z In(05) crtbora crt::common::MKS::OnConnectionStateChanged: remote mks set disconnect reason 29, so attempting to reconnect with retry count = 5 and duration = 8 sec.
2022-03-10T10:12:37.838Z In(05) crtbora crt::win32::MainMKSWindow::SetLockedDPI: Customized DPI :0 is set.

0 Kudos
kvmw2130
VMware Employee
VMware Employee

The logs are of the 1st of Feb I see there were network drops and the ping timer expired causing the disconnect:

Ping Timer Expiry

Line 886: 2022-02-01T07:38:58.590-05:00> LVL:1 RC: 0 SERVER :InputDevTap_GetKeyboardState @ timer: LEDs = 0x00 ==> 0x02
Line 1680: 2022-02-01T08:04:51.322-05:00> LVL:2 RC:-500 MGMT_IMG :Imaging Timer expiry.
Line 1696: 2022-02-01T08:05:19.489-05:00> LVL:1 RC:-504 MGMT_PCOIP_DATA :Unable to communicate with peer on PCoIP media channels (data manager ping timer expired)

Network Drops: 

Line 1352: 2022-02-01T07:50:52.085-05:00> LVL:1 RC: 0 VGMAC :Stat frms: R=000000/000000/020468 T=002160/029226/007581 (A/I/O) Loss=0.00%/0.17% (R/T)
Line 1379: 2022-02-01T07:51:52.348-05:00> LVL:1 RC: 0 VGMAC :Stat frms: R=000000/000000/022954 T=002160/032741/008292 (A/I/O) Loss=0.00%/0.09% (R/T)
Line 1414: 2022-02-01T07:52:52.563-05:00> LVL:1 RC: 0 VGMAC :Stat frms: R=000000/000000/025104 T=002160/035998/008817 (A/I/O) Loss=0.00%/0.74% (R/T)

Line 1618: 2022-02-01T08:01:53.461-05:00> LVL:1 RC: 0 VGMAC :Stat frms: R=000000/000000/043349 T=002445/053790/014877 (A/I/O) Loss=0.00%/0.10% (R/T)
Line 1639: 2022-02-01T08:02:53.671-05:00> LVL:1 RC: 0 VGMAC :Stat frms: R=000000/000000/045438 T=002445/058048/015507 (A/I/O) Loss=0.00%/0.06% (R/T)

 

 

 

0 Kudos
beachITguy
Contributor
Contributor

Thank you for the reply,

Is there anything that can be done to correct this? 
Like I have said this only happens to a few people when there a multiple people connected. and there is no way in which I can replicate the issue. And I have looked at our network configs and does not appear to be anything we can change on the network gear that would correct this.

0 Kudos
TechMassey
Hot Shot
Hot Shot

Based on the logs, you have two very nice 4k monitors ;). 

Due to Log4J, many companies including my own had to rush to Horizon 2111. First issues we encountered were graphical in nature, typically due to older Horizon 5.x clients. 

I agree that this isn't a networking issue, the PCOIP logs indicate no high RTT latency or packet loss. The behavior though can indicate the VM itself is freezing in vSphere, either due to a large VM CPU spike or constrained vSphere Cluster resources. 

However, I actually faced this exact issue a few months ago. New versions of Horizon and the Horizon client just don't offer any love for multiple 4k monitors. In the logs, you will see multiple entries for "unsupported display types/resolution." Instead, uninstall Horizon Client 2111 and drop in 2103.


Should be smooth going from there unless it is resource constraints in the datacenter. 


Please help out! If you find this post helpful and/or the correct answer. Mark it! It helps recgonize contributions to the VMTN community and well me too 🙂
0 Kudos
beachITguy
Contributor
Contributor

Thank you for the reply.

I will have a select few repeat offenders  downgrade their client and try to test and will let you know in a few days. 

 

0 Kudos
beachITguy
Contributor
Contributor

One of my users that downgraded clients just got back to me stating that they were just disconnected.

I got their pcoip server log file and have attached. again I only redacted the FQDN and IP.

Circling back to what you said about the resource constraints how would I go about finding that out? Like I said I do not have access to the VMware server or connection server I would have to relay this information over to them. But they say that the servers are configured properly and it is not their issue, which is why I am trying to track it down.

Also, circling back to what another person told me in a reply above, that there was a ping timer that expired and that is the reason why the session was dropped. Is there a way to increase the timeout? as detailed below

2022-03-24T13:44:25.554-04:00> LVL:1 RC:-504 MGMT_PCOIP_DATA :Unable to communicate with peer on PCoIP media channels (data manager ping timer expired)

is this controlled via settings on the server itself? or network related (Switch, FW, Router)

0 Kudos
TechMassey
Hot Shot
Hot Shot

That is unfortunate, the vSphere team won't at least provide an exported PNG graph of the cluster or virtual machine. The one item you can do is leverage perfmon for recording basic resource metrics on the VM. 

It is also unfortunate the slightly older 2103 client did not help. The issue impacted both PCOIP and Blast in my recent experience. 

On the timeout feature, I'm not familiar but as an alternate test you should be able to try a non-teradici device if allowed on a company workstation/laptop with the Horizon Client/HTML Access. 

One last item, there are valuable logs located both on the connection server and VDI desktop. They are specified in this link and are great for correlating timestamps in the client logs. 

VMware Horizon Client Log Locations - Location of Horizon (VDM) log files (1027744) (vmware.com)

 

One final item, I'm investigating additional symptoms around this issue occurring in the last 24 hours. It may be the same issue or a new variation, I'll post back here. 


Please help out! If you find this post helpful and/or the correct answer. Mark it! It helps recgonize contributions to the VMTN community and well me too 🙂
0 Kudos
beachITguy
Contributor
Contributor

Unfortunately, we are not able to  try a non-teradici device in the org.

I was able to get the debug logs from the connection server though and I have attached here. again I have redacted the IP, FQDN nothing else has been modified

I want to thank you for the help you have given me. and look forward to trying to figure this out. If there is anything else I can do or try I am open to suggestions.

0 Kudos
IboIboIbo
Contributor
Contributor

We have similar issue, did you figure out your issue or solution?

0 Kudos
skocatt
Contributor
Contributor

We do have identical issue in our environment and so far even VMware could not find the root cause.

Have you guys found something? Anything we could try?

Thanks

0 Kudos
skocatt
Contributor
Contributor

Guys, did anyone of you found any solution?

0 Kudos
jmacdaddy
Enthusiast
Enthusiast

Any chance that DRS is vMotioning the desktops and the stun duration is long enough to cause disconnect?  I have seen this in a number of my Horizon deployments.  You should be able to check the VM's logs in vCenter and see if a migration is occurring at the same time the user is reporting the disconnection.  

0 Kudos