Note1: We use Instant clone. Horizon View 7.4 & vCenter 6.5
Note2: We had something similar a few weeks ago. During a firewall upgrade, both vCenters of our 2 VDI environment had a network downtime and put the VDI infra in a weird state. We had to reboot all the connection servers and vCenter servers to bring it back to normal.
Now, after a network wobble over the weekend, our monitoring picked up a downtime of around 5 minutes on the vCenter of our VDI environment, in a normal environment you would see host disconnect/reconnect, nothing too bad. I expected Horizon View to be resilient enough to survive such a small issue.
I remembered what happened the last time so I went to check and there it is, same issue as last time ...
The issue is rather weird, there is connectivity between the CS and vCenter but it seems "some" things don't work.
25-Jun-2018 09:09:10 CEST: Failed to delete VM - Caught exception deleting VM XXXXXXX. Timed out waiting for operation to complete. Total time waited 5 mins
Pairing state:In pairing...
Configured by:<connection server #1> <connection server #2> ...
Attempted theft by:
Automatic error recovery for Pool YYYYYY: attempting recovery for Machine XXXXXXX
Which of course does not succeed. I could clean it with viewdbcheck but it won't actually solve anything so it's no use.
The creation of a new pool also fails.
As you can imagine having a network wobble break the VDI infra and having to restart everything is not sustainable for production especially as more and more users use it.
I'm new to Horizon View so any idea will be greatly appreciated.
"downtime of around 5 minute"
That is along time. Horizon in the back end uses an ldap database that tries to mimic the horizon infrastructure. When you delete a vm its deleted from the ldap database, and also vcenter. When that connection is interrupted the connection servers are no longer synchronized, and the way to fix it is to restart the connection servers. I don't know any applicaiton that can survive a 5 minute network outage. You should read this
We're experiencing similar issues since upgrading to Horizon 7.4 from 7.2.0 about a month ago.
We are seeing that if a vCenter goes down (for a maintenance reboot, Windows patching, etc.), View logs that it's unavailable for the duration that it's down, and reconnects when it's back up, but has severe issues until a Connection Server service restart takes place on one or all of the Connection Servers in the pod.
We have power on policies on all of our VM pools (persistent, ful clone VMs), where if a VM is shut down by a user, View tells vSphere to power it back up almost immediately. This operation stops working once the vSphere unavailability above take place. We also have provisioning problems (failed customization, "Error", "Missing", etc.).
This has happened at least 3 times over the last 2 weeks, and it's beginning to impact production (VMs powered off so users can't access them). We have had to constantly monitor the inventory and manually power them on, sometimes dozens at a time (we have a lot of VMs).
It's been resolved for us each time by recycling the Connection Server services overnight when fewer users are connecting (or by taking them out of the load balancers one by one and recycling the services), but that is getting old and not sustainable.
We're opening a BCS case shortly, but I wanted to add that we're seeing the same lack of resiliency since we upgraded to Horizon 7.4, that we didn't see on previous 7.x releases.
VCenter Server 6.0.0 Build 7462484
Thanks for the follow-up. We were given the new installer as well, but we haven't pulled the trigger and applied it yet.
Edit: The exe we received was VMware-viewconnectionserver-x86_64-7.4.0-8215536.exe. Looks like there's a newer build (8741716) others have received.
We asked for additional info on it (what else changes, can it be applied manually, how many customers have applied it, any known issues, etc.). With multiple View pods and close to 2 dozen Connection Servers, this is not a simple activity for us so we need to understand the risk before moving forward.
Is anyone here aware of any issues with the unreleased build?
As far as I can tell it is a very minor version (that's bad english). The support guys refer to it as a "hotpatch", though I could not tell a lot about what exactly is in it...
The potential benefit of it made it easily worth the risk for me as I can't tolerate a network blip to take down the whole thing and force me to ask everyone to log off to shut down all the connection servers. It is a major and concerning flaw in my opinion.
Check with the support but maybe you could apply the patch on the connection servers of one pod to start with and see how it behaves? The install does not require a reboot but it will kill the services for a good 2 minutes so mind that if you have servers in tunneling mode.
Let us know how it goes.
Ps: The second time I had the issue (pre-hotfix), I only cycled the connection servers and left the vCenter alone. Worked ok.