VMware Horizon Community
DarrenBull
Contributor
Contributor

PCoIP packet 'loss' after upgrade to 5.1

Hi

A couple of weeks ago we upgraded from 5.01 to 5.1 (and went through the delights of getting all these certificates to play nicely). However, since then, we are having terrible PCoIP performance through our 5.1 security server.

Firstly, we've been running View for years and since PCoIP was first introduced we have never had any issue with it. It has always 'just worked' over the WAN. Now, after this latest upgrade we find that although general 'static' usage is generally OK, if I wave a window around the screen the amount of traffic this creates causes the entire VM to hang up for about 20 seconds or so, then it 'catches up'.

General network performance between remote clients is good - no packet loss connecting to our firewall external interface and RDP sessions work well. It just affects PCoIP  - I have tested Windows based, 5.2 client but this affects everyone. Analysing the pcoip_server logs using the parser/log viewer shows that latency is always well within acceptable limits, as is throughput - there is lots of bandwidth available, and no major delay. However, we are seeing massive PCoIP packet loss when we do our 'window waggle' - up to 80% dropped.

Our security server is in a DMZ (obviously) but to eliminate that I have built a security server on the LAN and paired it with an internal broker. Same performance when connecting from the outside world. So, DMZ networking is out of the question. LAN switching is fine as looking at the same pcoip_server logs from desktops in use locally on the LAN show no loss. So, we're looking at firewall or external networking/routing. However, none of this has changed since the upgrade...why would this suddenly be an issue? Has the protocol changed to such an extent as part of 5.1 that our firewall now handles it differently? We do have remote sites that come in over IPSEC VPN into the same firewall and these also seem OK - it just seems to affect remote connections over SSL.

I have a call logged with VMware who have passed it on to Teradici, but from what I've seen so far I think I'm going to be told 'its your network'. Why then did it work without issue before?

If anyone can shed some light on this I would be very, very grateful. Its really getting on my nerves!

Reply
0 Kudos
13 Replies
DarrenBull
Contributor
Contributor

P.S. I think the most likely candidate is 'out of order' PCoIP packets arriving at the firewall...but then again, why now and not before?

Reply
0 Kudos
kurtd
Enthusiast
Enthusiast

After we upgraded to esxi 5.1 and view 5.1, pcoip stopped working completely, we just kept getting disconnected.  I spent hours on the phone with support but ended up figuring out the issue myself.  Turns out pcoip doesn't work with vm hardware version 7.  As soon as I upgraded to v9, it started working again.  Try upgrading the vm hardware version on one of your virtual desktops.  Just right click on it while it's powered off and you should have the option to upgrade.  Not sure if it'll help but it's worth a try.

Reply
0 Kudos
DarrenBull
Contributor
Contributor

Hi, we can't go to version 9...assume this is an ESXi 5.1 hw level? We're still on 5.0u1 and hw level 8. Anyways, seems running some iperf stats show major packet loss in our WAN link. Strange how this has only come to light with 5.1 which is supposed to work better over lossy networks, not worse! In any case its opened a whole new can of network worms and we can concentrate on that instead of PCoIP, although as I say this does make me call in to question the improvements in the protocol as this issue was not noticeable at all in 5.01.

Reply
0 Kudos
DarrenBull
Contributor
Contributor

Hi, its me again...and guess what - this isn't down to our network....

We have run some iPerf tests from a 'remote' ADSL connection into our Security Server which sits on the end of a 20mbs leased line. The ADSL is giving us 800kbs max. So, when running iPerf in 'unrestricted' mode, we see loads of packet loss. Effectively it is trying to send data faster than the link can support, so much of it is dropped. When we limit iPerf to 700kbs, we see zero packet loss. Nothing wrong with the network - just sending UDP faster than the underlying connection can support.

We now are almost sure this is also the problem with PCoIP. If we use pcoip.adm to configure the max bandwidth allocation as 700kbs, guess what - the test user does not get disconnected for 20seconds at a time whilst 'major' screen events are occuring...instead what we have is a very sluggish, but usable, connection.

So, it seems to us the recent 'improvements' in PCoIP have done nothing for us. Whereas in 5.01 the protocol used to adapt to whatever the underlying connection speed was very well, it is no longer doing so (or does not occur to be). Instead we are having to mess with it to get it even close to useable, whereas in 5.01 we never had to. Also, if we limit to 700kbs (for example), although that helps people out when working remotely, when in the office on the LAN this limit makes their experience worse than before, where they should be able to take full advantage of being local.

For now, everyone is using RDP on the WAN as a workaround. Surely this can't be the 'improvement' we were promised - is no-one else having such issues?

Reply
0 Kudos
Linjo
Leadership
Leadership

What kind of link is this?

How many users?

What kind of latency?

Any QOS on the link?

Do you have any idea of what the congestion algorithm is on the routers? (Some routers have "taildrop" as default and that is very bad for PCoIP, "WRED" is a much better alternative)

// Linjo

Best regards, Linjo Please follow me on twitter: @viewgeek If you find this information useful, please award points for "correct" or "helpful".
Reply
0 Kudos
DarrenBull
Contributor
Contributor

Hi, the link is a 20mbs leased line at the HQ end. It is maybe a maximum of 5% busy during testing. When testing, I can cause the issue with a single user connected from 'home' (this is using a public ADSL connection we have in the office), coming into this 20mbs leased line. Latency is well within limits at around 30-50ms on average, sometimes less. No QoS...but then there wasn't in 5.01 when it was fine. We have tinkered with the prioritisation on the firewall during testing but nothing seems to make a difference, using the Teradici WAN best practises guide. No idea about the routers - not ours, but this is so frustrating because before this was a non issue. I may take this up with the ISP but am not sure it will make a massive difference, but thanks for mentioning it.

I did test from home yesterday over PCoIP and everything was absolutely perfect, I could have almost run flash video over the WAN it was that good. And yet I can come into the office and use our test 'home connection' which was always fine on 5.01, connect externally to the 20mbs line and into our security server and its terrible. Average link speed on this connection about 800kbs - not the ideal 1mb Teradici say is required for 'bursting' of screen data, accepted, but my god its awful - 20secs screen freeze during major screen refresh. Many users who live in remote locations with poor internet are complaining of the same and are all using RDP without issues. On 5.01, they were happy. From what I can see, its just not as effective over restricted bandwidth connections as it was.

I have a call logged with VMware and Teradici and am waiting to hear back. Hopefully I've missed a gotcha somewhere.

Reply
0 Kudos
wstoffel1
Contributor
Contributor

Any updates on this?  I have the EXACT same issue and don't know where to turn at this point.

Reply
0 Kudos
DarrenBull
Contributor
Contributor

Hi, yep, sorry to let this one lapse but as users can use RDP no problem we've pushed it down the priority list a bit. Not the way its supposed to be but hey, people can work (although not using the protocol of choice). We are still working with Teradici who have been great at trying to sort this one out. Its taking so long to sort as we're being pretty slow to get back to them given the non-urgency.

It seems View 5.1 does work differently to View 5.0 in the way it handles latency. Although our normal latency on the test link is 25-30ms apparently the logs show spikes of 900ms. When this happens PCoIP throttles right back, thus causing our issue (they say).

Current recommendation is to disable all the enhanced Windows display seetings and so on although again this was never an issue in 5.0. Also to set a bandwidth floor for the session - counter intuitive maybe but its telling PCoIP "no matter how bad the link is, keep pushing data through at this rate" - kindof ignoring the latency spikes as I see it. Also mess about with frame rates and image quality and so on...again never had to do this before.

So, seems there is a difference in versions that may cause what we are seeing, although as yet we have not managed to sort it out. Will post if we ever do...beginning to think View 5.2 maybe a good first step.

Reply
0 Kudos
wstoffel1
Contributor
Contributor

That was fast, thank you.

We've been working with Teradici as well, response on their part is a little slow but I'm not going to complain since it's free Smiley Happy

I do have a question regarding using RDP.  Are you using View still, just with tcp3389 instead of udp4172?  Or are you actually using MSTSC to make the connection?  I ask because when i pressed the vm guys about switching, even for testing they did the former and still used View and we still had the performance issues.  I'm wondering if, and I'll have someone on site to do this test tomorrow, if they actually use MSTSC to make the connection, if they get the performance they are looking for.  I'm betting yes.  What's your take on it?

Thanks again!

Reply
0 Kudos
DarrenBull
Contributor
Contributor

Actually our RDP connections are fine, using the View Client. Replying to this has spurred me on to go and do some more testing and we do seem to be able to improve things quite a bit by setting a bandwidth floor of 750kbs and disabling build to lossless. It seems much happer like this, with Windows visual features maxed out or not.

Other settings used:

Maximum initial image quality: 70

Minimum initial image quality: 40

Maximum FPS: 6

Build to lossless disabled.

Max audio bw: 100kbs

Session bw floow: 750kbs

Max session bw: 4096kmbs

Now this works (as in improves screen freeze quite a bit, the desktop is now usable)  on our test ADSL link because we know the bandwidth available. If I deployed these settings to end users I'm not sure how something like 'minimum bandwidth' would behave if they didn't have that available to start off with. Also, whilst this my work on the WAN if this desktop was used on the LAN to view video, the result would be extremely choppy. We've downloaded the pcoip_config.exe which allows users to switch PCoIP profiles but unless I have to I'm not going to start confusing them with stuff like this, they can do without it. Helps for testing though.

With a bit of tuning it can be improved it seems. Question is, why did we never have to do this before? Seems to have got worse over slow/poor latency links in my opinion.

Reply
0 Kudos
wstoffel1
Contributor
Contributor

I've been fighting with the "build to lossless" issue in the sense no one wants to touch it, not even for testing.  For testing!  The users are high end graphics Autocad and Revit users so apparently that's a big no no.

You've now spurred me to push that!  I may have to take it upon myself and just set that group policy for my test users and see where it leads.

Thanks for the insights.  My guess is in the updated version someone somewhere knew the issues it would cause for a certain few of us, but for the vast majority the bug fixes and improvements were worth it.  They just pushed the issues down to the network layer figuring guys like us will figure it out eventually.  Well if application tweaks end up fixing this, then it's 100% a Layer7 issue.  It is fun tracking it down, i'll give em that..

Reply
0 Kudos
DarrenJHron
Contributor
Contributor

This also sounds very very similar to what we have been running into.  I just wanted to add that others are having inexplicable performance issues.  Looking forward to see if the suggestions work in our area.

Reply
0 Kudos
LarryBlanco2
Expert
Expert

10-7-2013 3-01-47 PM.png

Anyone come up with a solution? I have users on a 20Mbps link and Tx packet loss gets pretty high.   I've reviewed CPU,MEM, Daatstore and cannot come up with concrete evidence as to why this occurs. 

Reply
0 Kudos