VMware Horizon Community
epa80
Hot Shot
Hot Shot

Defender On Instant Clones

Was looking for some feedback from anyone out there using Microsoft Defender on Instant Clones. I'm going off this KB here from Microsoft, just trying to put the pieces together.

 

Onboard non-persistent virtual desktop infrastructure (VDI) devices | Microsoft Docs

 

This part specifically has me a little confused. Waiting for some feedback from Microsoft on what they mean but "single entry" or "multiple entries":

Depending on the method you'd like to implement, follow the appropriate steps:

  • For single entry for each device:

    Select the PowerShell Scripts tab, then click Add (Windows Explorer will open directly in the path where you copied the onboarding script earlier). Navigate to onboarding PowerShell script Onboard-NonPersistentMachine.ps1. There's no need to specify the other file, as it will be triggered automatically.

  • For multiple entries for each device:

    Select the Scripts tab, then click Add (Windows Explorer will open directly in the path where you copied the onboarding script earlier). Navigate to the onboarding bash script WindowsDefenderATPOnboardingScript.cmd.

 

We have our gold image squared away and ready to go (Defender enabled, up to date, etc), just kind of hung up at this script part. If anyone has gone through this already and has some tips, it would be much appreciated to hear about them.

Reply
0 Kudos
38 Replies
Jubish-Jose
Hot Shot
Hot Shot

We deployed Defender on instant clones a month ago.

Single entry and multiple entry simply means how they appear in the defender portal. If you choose single entry, irrespective of the instant clone lifecycle (refresh/ resync (which is essentially delete and recreate)) you will have a single machine entry. But if you choose multiple entry, after a delete and recreate operation, you will see two entries. I'm not sure what is the use case for the second one though. 

This blog explains it in detail: https://techcommunity.microsoft.com/t5/microsoft-defender-for-endpoint/onboarding-and-servicing-non-... 

 


-- If you find this reply helpful, please consider accepting it as a solution.
Reply
0 Kudos
epa80
Hot Shot
Hot Shot

Thanks for the reply. We're going with the Single entry method. Our Defender admin seems to not be seeing what he EXPECTED to see in the Defender console, so he's opening up a ticket with MS. Pretty sure we've followed all the steps outlined here:

 

Onboard non-persistent virtual desktop infrastructure (VDI) devices | Microsoft Docs

 

Did you have to create any script tasks on pool creation, like in the Horizon console itself?

 

Thanks.

Jubish-Jose
Hot Shot
Hot Shot

Are you seeing more than one entries in Defender?

I think the method outlined by MS may not work for Horizon. When the onboarding script is kept in the gold image, the script will run when the cp-template and cp-replica VMs are created from the snapshot. The SenseID and SenseGUID will be populated in these internal VMs and all the clones will be having the same SenseID which is not what we want. 

I created a simple bat script which calls the onboarding script and added it as post synchronisation script in the pool settings and this seem to work for us so far. 


-- If you find this reply helpful, please consider accepting it as a solution.
Reply
0 Kudos
rhawkins01
Contributor
Contributor

We are also running into issues with the cp-template and cp-replica VMs onboarding during provisioning, Can you share details on the post synchronization script you created? We've  tried this same thing, but we get errors during provisioning that customization timed out. We are going to try a scheduled task that executes the onboarding.ps1 on a 20 min delay, but I don't like that machines aren't immediately onboarded.

Reply
0 Kudos
Jubish-Jose
Hot Shot
Hot Shot

Check this blog. You need to increase the script timeout.

https://modernenduser.wordpress.com/2020/01/29/on-boarding-vmware-horizon-view-instant-clone-vdi-poo... 


-- If you find this reply helpful, please consider accepting it as a solution.
Reply
0 Kudos
epa80
Hot Shot
Hot Shot

Much appreciate the replies. I lost track of this thread and just hopped back in.

After some trial and error, and the helpful info posted here, we're settled on the following.

  1. Because we put the onboarding script in the gold image as a startup script, no one is onboarding properly. Likely because of how ICs are provisioned. This makes sense but we missed it on the initial push.
  2. We found on Technet this script to clear any existing onboarded info:
    epa80_0-1669052224635.png

     

  3. The script provided by the blog popvm is what we're going to use in conjunction with the above "deleting" script.

We're thinking we likely have to republish the pool and add the combined scripts as perhaps a post deployment task (or in the Horizon pool guest customization itself), but we're also looking into DEM or maybe a GPO to be more dynamic.

Any suggestions I'd love to hear. Again thanks for the replies as it's started to get us on the right track.

-Ed

Reply
0 Kudos
epa80
Hot Shot
Hot Shot

Feels like we're almost there. Manually performing the steps gets us where we want. Tristan's blog was a big help.

The one hiccup we're seeing though, if we create the postsync task like he details, the pool has the VMs erroring out. It looks like the post sync task bombs after a timeout. We did try extending the timeout using the key modification he provided, but it still failed. Is it possible since these are Instant Clones, and his key is worded like it uses Composer, is there a newer reg key to hack?

Thanks.

Tags (1)
Reply
0 Kudos
epa80
Hot Shot
Hot Shot

After identifying some human error, we think we finally have the onboarding process figured out. As I stated in an earlier post, we cleared all onboarding info on the gold image, placed a script that deletes any Defender senseGUID in place (none should be there but just in case), as well as performs the VDI onboarding powershell, and we took the .bat we placed them in and made it a post-sync task. Also edited the timeout on custom tasks to be 60 seconds.

Seems to run fine.

Kind of.

So, we tested it on a pool of 25 VMs, and all went great last week. No issues whatsoever, smooth, showed up as onboarded in the Defender console, tested logging off and deleting VMs to see they came back clean, all good. Tonight though, we republished a pool of 750 VMs. It had 150 sessions so about 600 did the push. Around 480 came back fine, looking good, but then all of a sudden the remaining ones hung on customizing forever, and eventually errored out. I tried cleaning them, but the only way I could get them to stop going into the error state, was removing the post sync task of running the script. So now about 480 are sitting in the Defender console running, but the remaining won't until I figured this part out.

All I could see that looked off, was the Antimalware Service Executable in Windows seems to be CRUSHING the VMs randomly. At least in disk usage. It's at 100% and doesn't leave it for a few minutes I'd say, and when it does, it hovers high still around 75%. It'll spike when I open new apps. EVENTUALLY, after I'd say 15 minutes, the system will cool off. I can only assume that's what the bottleneck was tonight. The first 480 provisioned and onboarded, but eventually the disk hammering dominoed and eventually became too much to overcome, so those other 100 just hung up and errored out.

As you can see the disk seems to be cooled off, memory a bit up there but I don't seem to feel it:

epa80_0-1669699942581.png

 

Any info on this would be greatly appreciated. The thread has definitely gotten us further, hopefully this last hurdle is indeed a last one.

Reply
0 Kudos
epa80
Hot Shot
Hot Shot

To give an example of how it looks when the issue happens:

epa80_0-1669729824684.png

 

Reply
0 Kudos
epa80
Hot Shot
Hot Shot

I did get a reply on a Microsoft thread, and the person there recommend we look at building the shared intelligence server workflow out. We were told by a Microsoft rep prior to our deployment that they didn't really think we needed the SIU server, that Windows/Microsoft Update handling the intelligence pushes would be fine. Seems this tech community support member may disagree. They even specifically said:

Updating the SIU also requires the machine to use disk resources to download the update, extract, and apply it.

Disk resources being used sounds like what we're seeing. They did say they think this has nothing to do with Onboarding, but I'm skeptical. This disk spiking issue, we're not seeing it in other pools that have Defender active, but DO NOT have any onboarding happening.

Reply
0 Kudos
rhawkins01
Contributor
Contributor

Do you have any exclusions configured for Horizon, AppVolumes, etc? If anything is attaching a login, ATP could be really curious why a 20GB virtual disk just popped up. We also configured a policy to set the average CPU load factor to 5% from its default. It doesn't prevent spikes, but averages out to be less than 5% CPU utilization.

If your post-sync scripts are still causing errors, check them and make sure they aren't blocked by any "mark of the web" tags. We were having the same issue no matter how long we extended the post-sync timer and then noticed a .ps1 was blocked. Ultimately, what we ended up doing today that has worked was building a scheduled task that is created and executed when the instant clone builds. This scheduled task runs the non-persistent machine onboarding.ps1 from Microsoft which then runs the onboarding.cmd. So far, this has worked successfully on 3 IC desktop pools and around 800 machines. Each instant clone has received a unique SenseGUID variable and is reporting correctly in the security console.

Reply
0 Kudos
paulmike3
Enthusiast
Enthusiast

@rhawkins01- What trigger did you use for the scheduled task at build time on the instant clones? We tried a scheduled task that runs the MS onboarding script at user logon and it worked, but InfoSec doesn't like instant clones sitting there not onboarded until a user logs in. They want them onboarded as soon as the IC is online.

Reply
0 Kudos
epa80
Hot Shot
Hot Shot

We do have exclusions for Horizon, but that's really it for VMware products on our image. We do use DEM but I don't know that we ever had exclusions for them. It's been in place for years.

For now, we're going to build out the Shared Intelligence servers, turn back on the Onboarding Post-Sync task and see what we get. We also have a ticket opened with Microsoft. At one time a Microsoft rep told us not to worry about the SI servers, but the writer of the Onboarding Non-Persistent VDI article on Microsoft's site is saying to do it. We figure at least have it for the purposes of the ticket.

Reply
0 Kudos
epa80
Hot Shot
Hot Shot

@rhawkins01,

Thanks for the reply. I don't THINK we're quite seeing the same issue. I think our post-sync task, running the onboarding script on VM creation, *IS* working ok, but something about it is causing our disk usage to skyrocket. We're not sure if it's too many onboarding at once, or a scan happening we're not aware of, etc.

Jesse from Microsoft who wrote the Onboarding guide is pointing us at doing the Security Intelligence server. He seems to think pulling defs from WU/MU may be our issue. We're goinng to move on to that next and see where we get.

I'm involved in this thread over there: Configuring Microsoft Defender Antivirus for non-persistent VDI machines - Microsoft Community Hub

Reply
0 Kudos
epa80
Hot Shot
Hot Shot

We're still seeing issues unfortunately. This is what I wrote to Microsoft.

 

Our environment is seeing still random incidents of large Horizon instant clone pools experiencing the high disk usage. This is even though we've completely removed the onboarding process on the gold image, to help isolate the issue further. Seems to enforce what you stated earlier that the issue isn't onboarding related.

 

We still have not built a Security Intelligence server but we are looking to today. I feel like this is also not our issue honestly though, as the problematic pool we saw yesterday has been spun up for a while, and also was republished Sunday afternoon. If a SIU was going to crush it, would we not see it sooner? We went through Sunday, Monday, and all day Tuesday until 2PM Eastern when we saw it.

 

Right now my suspicion is on the Defender GPO we have applied to where these machines live. The policy being used there was not designed specifically for VDI, it was actually a server GPO that our Defender admins seemed fine copying and just adding VDI specific exclusions for, but I'm thinking this was probably not the way to go. We're seeing things in there such as:

epa80_0-1670423092221.png

 

 

 

I really feel perhaps there is a scan/randomized task that is happening on our pools, that causes that issue. From 2PM until maybe 9PM yesterday we were seeing the issue, then without really any kind of change, it stopped. Implying that Defender runs normal, until something happens that starts killing disk, that eventually does end.

 

Sorry if all over the place, we were up pretty late. In a nutshell: we're going to review that GPO and see if we can test modding it on a dev OU/Horizon pool. I'm in favor or just removing Defender's GPO outright and having plain vanilla settings to test for a bit, perhaps with Horizon exclusions only, but that's where we're starting. We still haven't had any luck getting Microsoft support on a call, but hoping to push that some.

 

Thanks in advance.

 

Edit: wanted to add, as a troubleshooting step last night, we cloned the problematic image to a separate data center/ESXi host hardware and spun up 2 spare pools. The provisioning on them was BRUTAL, the exact same behavior. Once we hit like 800 VMs, the remaining 400 to provisioning were terrible. We looked at it, and the disk was high. We took the image again and this time disabled Defender outright on the image by disabling these services: Microsoft Defender Antivirus Service, Security Center, Windows Defender Advanced Threat Protection, and Windows Security Service. After disabling them, snapping, and republishing the pool, it went smooth as silk.

Reply
0 Kudos
Jubish-Jose
Hot Shot
Hot Shot

Happy to hear that your onboarding part worked fine!

I would first make sure that the security intelligence updates is done via server rather than each machine downloading and extracting it. Also, hope you have ran a scan in the gold image before sealing it for cloning. 

Attaching our scan GPO settings if that helps in some way. We also have a bunch of scan exclusions configured.

We also hit an issue with 100% CPU when the user logs into the VMs and Advanced Threat Protection was causing it. We had a few discussions with Microsoft, but their support was poor and finally we ended up doubling the CPU (we were planning it for some time, so we were prepared). 

Good luck!


-- If you find this reply helpful, please consider accepting it as a solution.
Reply
0 Kudos
epa80
Hot Shot
Hot Shot

Thanks so much for the reply/screenshot. We're going to compare against ours today.

 

We believe we've identified what was causing our VMs to see a disk usage crush inside Windows on deployment. One of our security admins was seeing inside the Defender/Sentinel console a task being kicked off on all the deployed VMs, but we couldn't find it on them. It was really driving us nuts. We decided to crack open the gold image and look there for something else, and we happened upon it. It's disabled now, but last night this task "Windows Defender Scheduled Scan" was enabled:

epa80_0-1670501579818.png

 

Again, on the deployed VMs OFF THIS SNAPSHOT, the task isn't there. We're still not sure what whacks it after deployment, but it certainly kicks off after deployment. This was the action he was seeing in the console that we were trying to track down, and eventually did back to that task:

epa80_1-1670501698688.png

 

 

If anyone else has run into this, I'd love to hear any input/feedback on your experience. This has been a real doozy. We also were NOT running scans inside the gold image before last night, we ran a Quick Scan before sealing it, and it sounds like we'll make that part of our image process.

 

Thanks.

Tags (1)
Reply
0 Kudos
epa80
Hot Shot
Hot Shot

and I spoke to soon. We saw similar behavior again today, even though we disabled that Task. I don't know. I'm just getting lost.

Reply
0 Kudos
epa80
Hot Shot
Hot Shot

Want to include this in case it's a clue to anyone.

 

Let's isolate this to 3 pools we have on the same image/snapshot. On Monday we saw the high disk usage on all 3 pools. Times to focus are a bit gray but let's say all 3 pools are 900 VMs each. Pool didn't SEEM to feel it, pool 2 was created and the issue might have started across both pools, but it was not bad, then we created pool 3 and everything went nuts. No users were here yet, just the pools created. It cooled down and users started using them. Again, same snapshot, same gold, on the same hardware.

 

Users get on, and at 1st (perhaps because the issue cooled off) we don't see any problems. As the usage grows and grows, it suddenly starts, last for let's say an hour or so, then stops. 

 

Our 1st troubleshooting step was to revert pool #1 to the previous snapshot. We did this but let users logged on already stay on the current snap. So to recap:

 

Pool 1 = Half current snap/Half new

Pool 2 = All New

Pool 3 = All new

 

Suddenly a day later pool 2 and pool 3 are seeing the issue, and the HALF of Pool 1 on the new snap are seeing it as well. The half that got the old snap DO NOT feel it. This would eliminate an idea that because the hardware is getting hammered everyone sees it, but we clearly don't. This gets us back to the snap now being seemingly the issue. That or the clone parent that was spun up on provisioning that all the children clone off of.

 

The snapshot I am less and less thinking is the issue, as I want to really say no changes were made that I can imagine triggered this. Specifically changes to Defender/Windows. Their really were none between snap 1/snap 2, really just our enterprise application (Epic) changed, that's all.

 

This gets longer so bear with me. This same snapshot was used on 2 other pools on different hardware that were provisioned at different times than pool 1-3. Pool 4 and 5 were on certain hardware, pools 6-9 on another set. These went with the same snap/changes as what we see problematic in pools 1-3, but we have YET to see them have the issue. I just hasn't happened. So I cloned their gold and put it on new hardware and created another 3 pools. We see the issue happening. Again: gold we haven't seen it on, cloned, and the issue seems to occur.

 

This brutally long post has me wondering 2 things:

  1. We still are not using a dedicated Security Intelligence Update server even though the Microsoft doc says to use it. Are we punishing ourselves and running in circles over this?
  2. Can a clone parent get created at a perfect storm time, a full/quick scan let's say kicks in, and when the children are cloned off of him, they repeat this same behavior causing our disk storm? If so, what is the possible way around this? The child VMs off snap 1 seeing the issue but the children on snap 2 NOT, even on same hardware, seems to point a bit this way.

 

Apologize for this convoluted post if you read through it, I know it's a spider web. If you got through and have an idea, still 100% all ears. Thanks again.

Reply
0 Kudos