Narrow down "bandwidth" issue

willmeister · ‎02-04-2013

I have a client that has two offices that connect into my datacenter to access their Windows 7 desktop VDI.

Stamford office connects with View client 5.0.1 build 640055 over their Time Warner 50/15 business class internet connection and has no problems.

Hartford office has a 20Mbps Layer2 Point to Point circuit to my datacenter and they have a whole host of performance issues. They get delays when typing, screen refreshes are horrible, horizontal banding in the display at times, very jerky mouse movements and so on. Same View version as Stamford.

If you connect to a "hartford vdi" from Stamford there's no problems. If you connect to a known working "stamford vdi" from Hartford you'll get the same horrible Hartford user experience.

Sounded initially like a circuit issue but with all the intrusive testing by both the carrier and myself I'm dead convinced it's a full 20Mbps circuit. The kicker being at any time that the user experience is at it's worst, the circuit is completely underutilized, and maybe pushing 6 or 8 megs.

It's as if View itself is throttling down. I'm not a VM person but have been troubleshooting the network on this issue far too long so I started looking at View log files. And I need help understanding them. I just collected the support logs, attached what i thought was appropriate. I'll gladly attach others if need be.

Is there a way to find the bottleneck in my network with these files? Please note also that at around 7:15pm in the logs I was running an iperf udp test from the the host with view installed to an iperf server in the datacenter.

thanks in advance.

Message was edited by: willmeister: Added the pcoip_server files i neglected originally. I included 2. One from the "working" site, one from the troublesome one.

CameronUBC · ‎02-04-2013

http://mindfluxinc.net/?p=195

If you cant get this program to work use the PCoIP performance monitor data sources to visualize any potential problems. Look for any large fluctuations.

mittim12 · ‎02-05-2013

The tool listed in the other post is a great utility but it also helps to know what it's looking at. Check out this link, http://myvirtualcloud.net/?p=751, and feel free to browse around the site as Andre has some great information on troubleshooting PCOIP. Also you want to include the pcoip_server file in any of your logs as that is where you will find a bulk of the connection information.

Don't be afraid to open a ticket with http://www.teradici.com/ too. I have found that they have wonderful support and are willing to go the extra mile to get you working correctly

mittim12 · ‎02-05-2013

I assume pcoip_server_2013_02_04_000011c0.txt is from the working site? Do you have nother one from the troublesome site as that didn't really didn't have any data in it.

willmeister · ‎02-05-2013

Yeah that one was from the working site. This is the first time i'm seeing these files so I'm not quite sure what I'm looking at. Initially I thought I mixed them up, with all the decreases in bandwidth I thought that file was the problem site and it was being throttled down. In fact that client is working, probably because it is being throttled? And that site does have limited bandwidth.

My problem site may not have sufficient bandwidth i suppose, but it's hard to rationalize it with the circuit barely being used

Here's several more files from a single workstation that consistently has performance issues.

Linjo · ‎02-05-2013

Look for QOS rules and incorrectly configured routers on the way.

If a UDP QOS rule is applied (have seen this is some university networks when they try to minimize the effect of UDP traffic like bittorrent) then PCoIP will have lowest priority.

On routers there are different algorithms to deal with congestion, TailDrop is bad for PCoIP, WRED is better.

// Linjo

Best regards, Linjo Please follow me on twitter: @viewgeek If you find this information useful, please award points for "correct" or "helpful".

wstoffel1 · ‎02-06-2013

As far as the point to point circuit, it's a straightup layer 2 circuit and the carrier swears up and down they are doing nothing. In fact from my perspective it's just a physical connection. My tests confirm that. Iperf -c 192.168.183.10 -p 4172 -u -b 20M -t 300 will flood the connection with UDP traffic, and no loss.

Client side are HP procurve switches minimally configured. Essentially out of the box and plugged in. No Qos, or Cos for that matter.

We don't have QOS enabled on our side, though they are Alcatel switches.

However regardless if it's View or Iperf, UDP 4172 is UDP 4172. I am going to revisit the switches on our side though. Thank you.

I'd be much happier if I looked at the flows and saw the link between sites at capacity, which is what the user experience indicates, but it's not the case. In fact the carrier removed all rate limiting for a week, we had a 100Mbps point to point circuit and there was zero performance improvement (limited only by the 100Full switch port at each end of the demarc). And the flows consistently showed between 4 and 8 megs of traffic. Except when I Iperfed with -b 100M and was able to see 97 megs in both directions.

Any other thoughts?

mittim12 · ‎02-06-2013

I see the entry "entering out of order mode loss detection" in the log files a couple of times. I've never experienced this before but had a VMware tech tell me that this could point to some kind of packet manipulation that causes issues. He refrenced Cisco devices running CEF when discussing the entry.

Maybe Linjo can go into more detail and tell us if this was correct information or give more detail.

wstoffel1 · ‎02-06-2013

Nope no CEF inline with any PCoIP traffic. The internet routers run CEF, but that's a dedicated internet connction for the VDI's themselves and is on a separate segment from this point to point. And I would have to believe it would affect other customers if the issue lay there.

wstoffel1 · ‎02-07-2013

I started another thread for something interesting I found last night in vShpere while looking at the packet counts in Inventory>Networking>Ports

http://communities.vmware.com/message/2191274#2191274

Obviously my issues are tied to this, just trying to figure out why....but here's what i saw:

All

Narrow down "bandwidth" issue