This initially started as hread in UEM forum. I thought though, after some work, it might be a better fit here.
Just this morning I logged into our pool we believe is having an intermittent issue runnings GPOs on startup. Some further testing showed another pool having the issue. The commonality of these 2 pools is they exist in the same VDI cluster of hosts. So now we're at that level, cluster. Another pool we tested in a separate cluster, we couldn't get the issue to happen. This testing really probably needs to be vetted out more, but, it's what we have now.
So, on the VM it happened on yesterday, below is a screenshot of the error in event viewer, and the text from the details tab is below that. You'll see that the error happened at 8:28AM. I didn't login through View onto the VM until about 8:55AM. Our working theory right now goes something like this:
Later in the morning I did this test in our problem pool:
So we're at a bit of a loss. There seems to be nothing consistent. Any input is appreciated.
+ System
- Provider
[ Name] Microsoft-Windows-GroupPolicy
[ Guid] {AEA1B4FA-97D1-45F2-A64C-4D69FFFD92C9}
EventID 1096
Version 0
Level 2
Task 0
Opcode 1
Keywords 0x8000000000000000
- TimeCreated
[ SystemTime] 2017-09-13T12:28:34.256250000Z
EventRecordID 45808
- Correlation
[ ActivityID] {8ECF79A7-CAD3-4787-BF63-A1AFA9B125C4}
- Execution
[ ProcessID] 112
[ ThreadID] 1320
Channel System
Computer REDACTED
- Security
[ UserID] S-1-5-18
- EventData
SupportInfo1 2
SupportInfo2 1254
ProcessingMode 1
ProcessingTimeInMilliseconds 3188
ErrorCode 64
ErrorDescription The specified network name is no longer available.
DCName \\domain.controller.fqdn
GPOCNName LDAP://CN=Machine,cn={DE16CA21-9FDB-4B20-8FED-DC8297247855},cn=policies,cn=system,DC=rdacted,DC=redacted,DC=redacted
FilePath \\domain\sysvol\domain\Policies\{DE16CA21-9FDB-4B20-8FED-DC8297247855}\Machine\registry.pol
If by chance anyone sees this issue again, we found the culprit. Can't explain why, but, we found it.
It turns out we needed to exclude the file C:\Windows\ntbtlog.txt from being scanned by Deep Security. If we do that, on refresh, the VMs don't seem to get the random issue of some sort of network blip, that resulted in the initial pull down of GPOs to fail. I can't find a single KB/best practice guide stating anything about this file at all, let alone for exclusion reasons.
If someone is aware of what kind of relationship ntbtlog.txt has to the Refresh process, I'd love to hear about it. Again, on a full recompose/deployment of a full, the issue never happened. Upon refresh, about 1 in 4 or 5 VMs would experience it. Once we excluded the file from being scanned, we are able to refresh over and over without the issue.
Thanks.
and I just came across this. Sounds similar/promising. Would love to know why it only seems to happen with 1 cluster though.
https://www.reddit.com/r/vmware/comments/5fypps/linkedclones_and_group_policy_gpo_not_applying_at/
We implemented the 2 minute wait, but, didn't have any luck. The issue remained. We've now opened a ticket. One other failure we're seeing in the event logs is this:
Error | 9/14/2017 11:50 | Service Control Manager | 7026 | None | The following boot-start or system-start driver(s) failed to load: ftsjail vnetflt |
vnetflt I believe has to do with the NSX introspection via VMware tools. We are indeed an NSX shop with Deep Security 10. Interestingly in our other VDI environment that isn't having the issue, we are still using vShield and utilizing DS 9.6. Wondering if something related to this upgrade/migration to NSX/DS10 is the culprit.
Looks like it may be NSX related. When we disabled the NSX deployment on a cluster where the issue exists, I was able to refresh 50 VMs 3 times in a row. 0 of the VMs had the issue. I deployed NSX guest introspection back to the cluster, 1 refresh brought back 7 bad VMs, another refresh brought back 11 bad ones. I've updated my ticket with support, waiting to hear back.
aaaaaaaaaaand I may have spoken to soon. seems the issue might be with our anti-malware product, Deep Security 10. With NSX deployed, but the Deep Security appliances disabled on our test cluster, the issue doesn't happen. I deploy and refresh the pool over and over, and nope, no issue. We upgraded our VMware tools to 10.1.10 and then brought back NSX, when we didn't see the issue, we said ah that's it. 10.1.10 needs to be there to fix NSX. But then we activated Deep Security.......
Once I activated the Deep Security appliances, the issue rears its head. What's odd is the way it comes back. Upon an initial deployment, all goes well. The 50 VMs spit out, they activate fine. When I refresh them though, is when the desktops randomly experience the issue. And it's never the same one each refresh. Some have the issue, some don't. Some on this host, some on that.
I have to think it's how DS handles the refreshing. Upon desktop deployment, when a new VM is created, DS is told to do nothing until 10 minutes passes, then activate the VM for protection. Upon refresh, they're not considered a newly created VM. They were there already, just refreshed. I assumed DS would treat it as if the VM was rebooted. It doesn't know the refresh happened, it just sees a VM there that was off and now its back. But I guess something is going on, as when they first come back, something goes on that causes them to have a network blip and not do their GPO pull.
Anyway, I have a ticket into Trend Micro now as well. Figured between them and View/NSX, we should be able to nail this. Maybe someone else will have experienced something similar and reply, so I can stop talking to myself.
Little bit more testing confirmed my previous post.
If I restart these problem desktops, NOT refresh just a normal restarts, it will come up fine. It's something about refreshing and Deep Security.
So, it looks like Trend Micro will be where I start, with some VMware assistance as well.
If by chance anyone sees this issue again, we found the culprit. Can't explain why, but, we found it.
It turns out we needed to exclude the file C:\Windows\ntbtlog.txt from being scanned by Deep Security. If we do that, on refresh, the VMs don't seem to get the random issue of some sort of network blip, that resulted in the initial pull down of GPOs to fail. I can't find a single KB/best practice guide stating anything about this file at all, let alone for exclusion reasons.
If someone is aware of what kind of relationship ntbtlog.txt has to the Refresh process, I'd love to hear about it. Again, on a full recompose/deployment of a full, the issue never happened. Upon refresh, about 1 in 4 or 5 VMs would experience it. Once we excluded the file from being scanned, we are able to refresh over and over without the issue.
Thanks.
Hello ,
I'm facing same problem but i didn't use NSX or any endpoint it hard to judge this issue but i think optimization tool could be good reason
cause it never happened when golden image was before optimizing so it could help
thanks