VMware Cloud Community
njcmdrx
Contributor
Contributor

Windows 2008R2 Domain Controller sometimes goes 100% CPU during a snapshot or vmotion

I have an environment with roughly 17 ESX hosts, 1000 vm's, 100 SQL servers, 500 Citrix servers and 20 or so domain controllers covering just as many different domains.

For some years now, we started to get random issues on domain controllers going 100% cpu in the mddle of the night.  While there are a few boxes that seem more likely to go 100% over others, and there are others that don't do it at all, its completely unpredictable.

I have identified some 'triggers' over time:

1.  initially, we noticed that during late night reboots of several hundred citrix servers, DRS would kick in do the snap>move>unsnap process and cause a DC to go max CPU.  We cant login anymore and we are forced to hard boot it to get it back to working.

2. I have seen occaisional issues where there was no DRS vmotion or snapshot and they still go 100% cpu, but I cant discount the snap/unsnap processes of backup and DRS either.

3. Last year we added a backup from SAN system to backup our virtual images directly from the SAN overnight.  Now, I have started to see that the snapping/unsnapping process is beginning to manifest this 100% cpu problem as well.  NONE, as in zero, of our other vms are going 100% cpu after a snap, but once or twice a week we have a group of DC that have random instances of this 100% CPU kill.

I definitely see some correlation with the snap/unsnap processes and the activities when a vmotion/DRS/snapshot takes place.  I see no event log entries in windows giving any clues.  In fact, the event log stops logging at the moment it goes 100%cpu.  We could just isolate these boxes and never drs, vmotion or snap them, but thats not completely practical either.  I want to understand what is happening better to mitigate or stop this from occurring.

We followed the virtualizing domain controller best practices when we implemented this, and I am now going back and re-reading it again to see if anything is missed.

I wanted to post here to see if anyone else has had to battle this problem before and if they had any fixes or thoughts.

Reply
0 Kudos
2 Replies
jpsider
Expert
Expert

‌hhave you reviewed the event logs to see if anything funny is goi on? Also it's possible vmtools is sucking up the cup during the underlying VMware operations.

have you ever been able to check taskmgr to see the offending process?

HAve you increased the Cpu's or memory to try and resolve it? (I doubt that would work, but why not ask!)

Reply
0 Kudos
BoneTrader
Enthusiast
Enthusiast

are all cores @ 100% ?...

I think we had a similar Problem once, if i rememeber correctly we updated Windows Kernel and all went back to normal

Reply
0 Kudos