This is a challenge I faced with Published Applications using VMware Horizon 7.0.0 in April of 2017.
- VMware Horizon 7.0.3 (recently upgraded from 7.0.0)
- Multiple Manual RDS Farms
- App Volumes 2.9
- User Environment Manager 8.7
- vSphere 6.0
- 4 x Connection Servers, 2 x Security Servers, load balanced via f5 BIG-IP
The above platform had been running successfully for six months. One morning, all of the RDSH servers in one particular Farm started blue screening. There were a number of Application Pools published from that farm, however they were all instances of one application delivered to the RDSH servers on an AppStack. All of the servers in the farm were impacted, but not at the same time. The blue screens begin around 4 a.m. in the morning, and stop around 3 p.m. The following day the same behaviour occurs.
Day 1: We confirmed only one farm was impacted, so investigation centered on the applications being delivered from that farm. The application had been recently upgraded, so we reverted to the previous AppStack, however blue screening continued to occur. Attention redirected to one instance of the application which was being executed from a network share, this was modified to run from the AppStack which "appeared" to resolve the issue.
Day 2: Servers begin to blue screen again just after 4 a.m. Troubleshooting turns to the version of App Volumes, suspected filter driver issue. This is ruled out however as only one farm is being impacted; other farms have applications delivered with the same App Volumes servers which are not impacted. Attention turns to the recent Horizon upgrade (7.0.0 to 7.0.3), only the Connection and Security servers have been upgraded, the Agent in the RDSH servers is still on 7.0.0. The rest of the day is taken up planning a roll back of the environment to 7.0.0. Blue screens stop around 3 p.m. Dump files are collected and uploaded to both VMware and Microsoft Support services.
Overnight, half of the Connection Servers and Security servers are removed from the existing Horizon POD and rebuilt to version 7.0.0 in a separate POD. The load-balancer is reconfigured to send all connections to the new nodes only. Each RDSH server is reprovisioned and registered with the new POD using an updated template and provisioned applications from App Volumes. Access to the new environment is made available at 5 a.m.
Day 3: RDSH servers in the freshly built farm start blue screening shortly after 5 a.m. Another support call with Microsoft and VMware started. During this call, while looking through another set of uploaded crash dump files, it is discovered the RDSH display information is suggesting there is a connection being established with four monitors but a NULL primary display. Engineers start investigating connections to the environment using Splunk logging of Horizon Volatile Environment Variables, and discover there are multiple connections being established with four monitors and one with five. After contacting the user with five monitors and asking them to log out, all server crashes cease.
The issue was caused by incorrect handling of a connection with five monitors, when the fifth monitor was set as the primary display on the endpoint. This was confirmed through rigorous testing. Only one farm was impacted as the user only had access to one published application within the environment. VMware provided an updated library file with correct handling of a NULL primary display and the environment was returned to version 7.0.3.