jesseward1
Enthusiast
Enthusiast

Virtual Domain Controllers Locking Up

Jump to solution

Has anyone experienced issues with virtual domain controllers locking up in vmware?

Assume the following scenario:

Site 1 has two domain controllers on separate hosts in cluster 1.

Dcdiag tests confirm that when on and running the DC's are healthy.

I have a 3rd domain controller in site 2, on cluster 2 served by the same VCenter server.

VCenter server is also virtual and a member of the same cluster as the two DC's in site 1.

*All servers are Windows Server 2008r2.

*All three domain controllers also service DNS, DFS, Radius.

*All hosts are ESX5.1.0 build 2000251

Any suggestions would be greatly appreciated.

1 Solution

Accepted Solutions
jesseward1
Enthusiast
Enthusiast

Looks like VMWare support was is right with their finding of a memory leak in the vshield driver of vmware tools: VMware KB: The vShield Endpoint Thin Agent driver (vsepflt.sys) and the vShield Endpoint TDI manager...

After uninstalling the vshield component on the DC's the issue has gone away. I also noticed that this was affecting my Veeam proxy servers.. backup jobs would hang at 99% and would lock up the proxy servers forcing me to reset the guest.

Thanks again for all the help!

View solution in original post

0 Kudos
17 Replies
jesseward1
Enthusiast
Enthusiast

Wanted to bump this with an update. So far I have been running stable for 24 hours after uninstalling VMWare tools from one of the domain controllers in site 1.

0 Kudos
bspagna89
Hot Shot
Hot Shot

Hi,

Welcome to the community. Can you be more specific when you say lock up? Is the VM accessible via the VMware console? If you go into the console, is it completely frozen or does it come back alive?

What hardware version are the VMs set for? What kind of network adapters are you using? We've all seen weird enough things when using E1000 so please ensure you are using VMXNET3.

Look forward to hearing back from you.

New blog - https://virtualizeme.org/
0 Kudos
jesseward1
Enthusiast
Enthusiast

Thank you!

When the guest locks up the VMWare console goes black and are non-responsive to anything but a reset/shutdown. I have monitors on all critical services (ping, dns, dfs, etc.) and they all report unavailable. Both guests are hardware version 8 and using the legacy E1000 driver. I have read that the VMXNET3 driver is the recommended driver and have that on my list of changes to make as I troubleshoot the issue.

0 Kudos
bspagna89
Hot Shot
Hot Shot

Reinstall VMware tools.I would recommend trying to bump the hardware version up to 9 (I believe you can support that under 5.1) - Also, swap out the NICs to VMXNET 3 on all of them.

Let's try this and see if it cleans up the lockups.

Let us know how you do Smiley Happy. Also, let me know if you have any questions with anything I've said in regards to how to or process.. etc.

New blog - https://virtualizeme.org/
0 Kudos
jesseward1
Enthusiast
Enthusiast

Thanks for all your suggestions. I am in the process of changing each of these individually and testing the results. I will post updates as I rule them out.

0 Kudos
jesseward1
Enthusiast
Enthusiast

Latest update in my troubleshooting of the issue..

The two domain controllers in site 2 have been running healthy now for 48 hours without vmware tools installed. My unaffected domain controller 3 in site 2 encountered the same locking up issue 13 hours after installing the latest version of VMWare tools. This guest is identical to the others with the exception that it is VM Version vmx-09 instead of 8.

0 Kudos
bspagna89
Hot Shot
Hot Shot

Does DC3 in site 2 have an e1000 nic? If so please change it to VMXNET 3 and keep an eye on it.

New blog - https://virtualizeme.org/
0 Kudos
jesseward1
Enthusiast
Enthusiast

I think I am leaning towards a VMWare tools problem. DC1 ran all weekend without any issues. Yesterday I reinstalled VMWare tools and the server locked up last night.

0 Kudos
bspagna89
Hot Shot
Hot Shot

Did you remove the E1000 NIC?

New blog - https://virtualizeme.org/
0 Kudos
jesseward1
Enthusiast
Enthusiast

Made the NIC change today. I will update tomorrow with any progress.

0 Kudos
bspagna89
Hot Shot
Hot Shot

Great. I am pretty confident that could be causing your issue.

New blog - https://virtualizeme.org/
0 Kudos
Alistar
Expert
Expert

Howdy,

seeing as this concerned a wrong type of vmnic, perhaps you'd like to scan your environment for any more culprits you could have? Check out my script Threaded Report Generator: VMs without a vmxnet3 vmnic | VMXP Smiley Happy

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
jesseward1
Enthusiast
Enthusiast

I have gone 48 hours without any issues since making the NIC change. Not entirely sure as to why this was the solution... I have used the e1000 NIC in this environment for a few years.

So to sum up what I found:

Cause: VMWare Tools update 9.0.12 build-1897911 (Reproduced on multiple DC's)

Solution: Upgrade NIC to VMXNet3 virtual network adapter

@Alistar: Thanks for the comments, I will probably need that script in the future. At the moment our entire environment (150 VM's) is using the e1000 NIC. So I have some hardware cleaning up to do.

UPDATE:

Spoke too soon, issue occurred at the 52 hours mark.

grasshopper
Virtuoso
Virtuoso

One thing you may consider when the GOS in question fails, take a snapshot, then use vmss2core to create a memory.dmp from the .vmsn.  You can copy the .vmsn (will be the size of the VM memory) to your desktop using WinSCP then run the vmss2core tool against it to create the memory.dmp.  Just make sure that when you right-click > snapshot > take snapshot  that the "Snapshot the virtual machine's memory" box is in fact checked (this creates the .vmsn).  If your VM is 4GB of memory, the .vmsn will be 4GB.

Reviewing the memory.dmp often gives some deeper insight into what's happening under the black screen.  Share the memory.dmp with Microsoft Support (they are familiar with this vmss2core technique) or load the appropriate symbols and view it yourself with Windbg.  If this is production, I would share the memory.dmp with Microsoft to be safe.

If performing the above does not give any additional insight, you may need to run perfmon until failure and review later.  It's possible you could be running low on free PTE's or otherwise starving the OS from a memory leak somewhere (i.e. VMTools, etc.).

jesseward1
Enthusiast
Enthusiast

The issue just occurred again and I took the snapshot as suggested. I look forward to seeing the contents of the memory dump.

On another note, I received a response from VMWare support that suggested this KB:

VMware KB: The vShield Endpoint Thin Agent driver (vsepflt.sys) and the vShield Endpoint TDI manager...

I think a suitable test for this will be to reinstall VMWare tools without the vShield drivers and see if the issue still occurs.

0 Kudos
markzz
Enthusiast
Enthusiast

I'll add my 2 bobs worth.

It's very unlikely the issue is caused by the NIC driver particularly as the e1000 is natively supported by the OS.

All though VMware's best practice is to use the VMNET3 driver on OS's which are supported, it is not a requirement. The E1000 works fine and although it has a marginally higher resource over head in the VM World than the VMNET3 the E1000 is in some ways preferable..(this is a lengthy conversation not to be had now)

I have seen VM's lockup but have not attributed the issue to VMTools.. The issue as always been OS centric.

BUT

You have identified the issue is very likely related to VMTools..Your trial and error has proven this..

What do you see in your OS event logs at or near the time the OS hangs????

VM Support could be on the money.., The end point agent would not be installed if VMtools are not installed..(are the end point agents installed with a typical install of VMTools??)

Could anyone comment on are the end point agents installed with a typical install of VMtools??

Jess could you share your Event logs "System and Application" from the effected OS at or near the time of the hang.

0 Kudos
jesseward1
Enthusiast
Enthusiast

Looks like VMWare support was is right with their finding of a memory leak in the vshield driver of vmware tools: VMware KB: The vShield Endpoint Thin Agent driver (vsepflt.sys) and the vShield Endpoint TDI manager...

After uninstalling the vshield component on the DC's the issue has gone away. I also noticed that this was affecting my Veeam proxy servers.. backup jobs would hang at 99% and would lock up the proxy servers forcing me to reset the guest.

Thanks again for all the help!

View solution in original post

0 Kudos