I have had a ticket open with VMware since February regarding this issue. We have a newly created Windows 10 image for use with 2 automated pools (M10 and M60). On a daily basis, half of the m10 pools desktops will be in an agent unreacheable state. When this happens a user logging in will see a message "desktops are currently not responding". The desktops in question will have correct IP addresses, but are not pingable or self assigned IPs. The agent version is 7.0.3 on both pools. Turned off firwalls, issue persists.
Some additional details:
1. The issue only happens on these 2 pools and a set of 3 Grid servers with dedicated graphics (so when this issue happens we cannot see the desktops through vsphere). The issue is not with every desktop, we may have 4 available and then have 4 agent unreachable machines.
2. A little over a month ago we disabled apipa and the found that the machines that would get self assigned ips now get ip v6 addresses.
3. If we keep deleting machines, upwards of 10, we can get a set that are all available, but they will eventually come back as users log off.
4. Here is another new thing that has become noticeable within the past month. Sometimes machines will have correct ips (ones that are on the the correct vlan), but when pinging the machine name, I get a different IP. When I check dhcp, the machine will sometimes have the same ip showing in vsphere and other times the ip cannot be found anywhere in our split scope linked to that machine. Again, we have 4 other pools that are win 7 and on different servers that don't have any of these issues at all.
5. We created a new windows 10 pool and the issue occurs with that pool as well.
Our dhcp has plenty of ip addresses available. We have our DNS set to scavage every 8 hours (on one server in the scope). Any hints on any other places we can look? At this point, I'm willing to dismantle anything.
Which AntiVirus Solution are you using? Agent or Agentless?
We had a pretty similar case - the fix was to use this:
Antivirus executable exclusion list for VMware Horizon View 5.x (2082045) | VMware KB
Another thing to consider which we ran into was a specific financial application that required unique certificate installation using a vendor script. This was changing the permission on the crypto folder in the machine. This stores the R system files preventing the view agent from working properly. This was in Horizon 6.2.2 and Windows 10 LTSB.
Sorry somehow the previous entry got posted before I could finish it. So the crypto folder stores the RSA token used by View agent. Since it doesn't have permissions to it anymore, the agent is not able to start. Had to change the permissions to allow everyone to read the folder change ability for the specific key. Or another option to re-install view agent to generate a new key which has the appropriate access.
1. VDS. I have not attempted to change the port. They are definitely losing connectivity, sometimes instead of "agent unreachable" we get " Provisioning error occurred for Machine NewMerritt-05: Customization error due to no network communication between the View agent and Connection Server".
2. Some days there are multiple entries for some of the machines, but not today. I have 6 current unreachable and all of them have single entries in DNS and they match the DHCP table.
3. There are 3 hosts. The issue is across all 3.
4. Yes they are all correct.
5. Yes all are on the same version 6.0.0, 5050593.
We do not have an antivirus solution installed, nor do we use software with specific cert requirements. Thanks for all of your assistance in things to look into. I'm not sure whats different about the machines that work and the ones that don't. I do know that the ones that are working now, will likely give agent unreachable in the morning (or at the next reboot).
Try changing the port for one of the desktops that is not in the network. I am assuming you are using VMware vDS and not a third party vDS such as Cisco 1000v. Does doing this bring it back in the network? Have had a similar issue but in 5.x version and changing port brought the VM back in the network.
If its only happening on the Windows 10 desktop, have you tried building a new gold image from scratch (just for testing sake)?
Try a different pool with the new Horizon 7.1 Agent installed - maybe there is something wrong with either the Agent or VMware Tools
Use this Scheme
Connect to VM or Template from vCenter Console
Uninstall VMware View Agent
Reboot
Uninstall VMware Tools
Reboot the machine
Delete HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\VMware Tools
]
Delete HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\VMware
VDM ]
Delete C:\Program Files\VMware\VMware View\*.* ] might
not be neccesary and the path depends on the gues OS
Delete C:\Program Files\VMware\VMware Tools\*.* ]
Delete C:\Documents and Settings\All Users\Application
Data\VMware\VDM\*.* ]
reboot the machine
Install VMware Tools (Deselect SVGA driver)
reboot the machine
Install VMware View Agent (7.1)
reboot the machine
What the desktop sceen looks like when you open it using vCenter client?
If it blue screen (which usually says the windows crashed) then thie memory dump needs to be analyzed to know the cause is.
If it is just a black screen (this is normal if you use PCoIP or Blast), then the desktop may be hang and you can dump the desktop by suspending it.
Raise a support ticket along with memory dump of BSOD or .vmss (and .vmem if exists) file. You can find memory.dump in windows folder and
.vmss file in your virtual machine directory.
I have seen similar behavior like this and it is usually a mismatch between what the DHCP scope identifies as an available IP address and what the Active Directory DNS zone has registered. Try the following test.
1. Modify the VM name for the pool slightly, for instance add a -DNS to the name. (This will force new unique DNS records)
2. Set the pool to refresh on logoff only. (VM will not be deleted and thus will never need to re-use names.
3. Rebuild the pool.
I would be very curious if any VM goes offline at some point in the future after that setup. This is only a test to verify the issue and does not need to be a permanent solution.
Thanks for the replies and assistance with this issue guys.
h3nkY,
Good question. We are unable to see what is going on through vcenter because they use dedicated NVidia graphics. On working machines we use a combo of vnc viewer and RDP, but with no network connection, both of those are off the table, so we are flying blind.
dineshgoundar,
When you say change the port, do you mean the port group? We have done that on the VMs and there is an 80% chance of it reconnecting within a minute of doing that. I have successfully pulled logs from th
SchwarzC,
We attempted to upgrade the agent, but for some reason in our environment when logging into the machines, we are immediately logged out of them. This happens even with physical/manual pools as well as automated ones. I'm guessing its because our connection server is still 7.0.3. We will upgrade both, but it probably won't be until next week.
WarrenM01,
I'll try this today and report back with results.
Thanks again for everyone's help.
Hi Greg,
I meant editing the settings of the VM --> select network adapter --> Switch to advanced settings --> change the Port ID to an empty port.
To check for empty port, go to Networking view in vCenter, select the port group the VM is in, select Ports tab and choose a free port number.
You mentioned "80% chance of it reconnecting within a minute of doing that", does it mean the VM comes back in the network and becomes "Available" in Horizon? Whats the teaming setting in the vDS and the physical uplinks?
Ah. I see. Changing the port didn't get the machines back up and running. I was off on my original statement. When changing the portgroup to something else and changing it back to vdi portgroup, the machne instantly comes back to life as an available machine with the same port number. Also, it comes back 100% of the time accept under one condition. If View shows an ip for a machine and I ping by machine name, I get a different IP address. I initially thought 80% until I looked closer at what was different between machines.
The IP that it pings to is unreachable and does not appear in DNS or DHCP linked to the vm in question. The ip that shows in DHCP matches the one in vcenter. The one that it pings to in cmd is linked to a different machine and pool. So, my issue currently is trying to figure out why the machines come back when changing the nic and where do the mystery ips come from. I'm checking on the VDS and Physical uplinks now.
Not sure if this is the issue but we saw something similar in our environment and had to up the vGPU profile. You can view the vmware.log for the machine to see the error. When the memory was exhausted the machine would lockup.
Here is the release notes.
Known product limitations for this release of NVIDIA GRID are described in the following sections.
vGPU profiles with 512 Mbytes or less of frame buffer support only 1 virtual display head on Windows 10
Description
To reduce the possibility of memory exhaustion, vGPU profiles with 512 Mbytes or less of frame buffer support only 1 virtual display head on a Windows 10 guest OS.
The following vGPU profiles have 512 Mbytes or less of frame buffer:
Workaround
Use a profile that supports more than 1 virtual display head and has at least 1 Gbyte of frame buffer.
I think we finally got it resolved after all of these months. Thanks for all the support on this. It appears that everthing that was happen on the VMware side of things was just a symptom of the server nic configuration. I opened a ticket with dell (a five hour call) and they found that the NIC configuration on the Emulex cards was wrong. There was a fixed NIC on each card and then 3 func# that you could make any type of protocol (nic, iscsi, or fcoe) that you wanted we set all configurable func#s to none (except the FCoE) and the machines have been running as expected over the last 2 days for the first time. Everything that was unreachable instantly became available without deleting. Hopefully this helps someone.