Solved: Sanity Check - VM Freezing, No Obvious Reason (to ...

DMancini_XDS · ‎03-13-2024

I have a situation were one of my VMs keeps freezing at random times for no obvious reason. I can't touch the VM in Workstation at all, not even to "cut the power," and it vanishes from the network completely. But the vmware-vmx.exe is still running (and throws 'Access Denied' if I try to kill it).

SETUP:

I'm running WorkStation Pro 17.5.1 build-23298084 on a Windows Server 2019 host, running 2 Windows Server 2019 guests.

Host and guest OS's are same version - Windows Server 2019, 64-bit (Build 17763.5458) 10.0.17763.

HOST:

Dell Poweredge R630 (iDRAC reports no hardware faults)
2x 750W power supplies
128 GB RAM @2400 MHz (4 x 32GB, 2 per Physical CPU), Multi-bit ECC
2x 8-Core CPUs (Intel Xeon E5-2667v4 @ 3.20 GHz), 16 physical cores / 32 logical cores
OS Drive - RAID-1 (2x 465 GB SSD, Adaptive Read Ahead, Write Back)
VM Drive - RAID-10 (6x 223 GB SSD, Adaptive Read Ahead, Write Back)

2x GUESTS:

1 processor (8 cores)
32 GB RAM

Now, according to my arguably uneducated understanding of the way virtualization works, I don't *think* I've over-scheduled the resources on my host; I've left 64GB of RAM on the table (there are virtually no apps running on the metal, just a file share); I've used half of the logical core count; and the VMs are stored entirely (vmx's and vmdk's) on the second drive (not the OS drive for the host).

Is there a problem with my setup that would cause a VM to simply go AWOL? Have I failed to understand proper resource scheduling somehow? Or should I consider that my hardware may simply be failing?

I've attached the log from today; something seems wrong just looking at it, but I'm not versed enough in diagnosing these things to know for sure what.

bluefirestorm · ‎03-14-2024

Is Hyper-V or any components that use Hyper-V such as VBS, WSL2 enabled on the host? Hyper-V on the host results in the slower Microsoft Hypervisor API being used instead of ring 0 Intel VT-x native calls. Check msinfo32 at the host or vmware.log of any VM and look for "Monitor Mode". The text "ULM"" indicates it is the slower hypervisor while "CPL0" is for the ring 0 Intel VT-x.

Since the host hardware has two CPUs, at the host UEFI, the "node interleaving" should be "disabled" for better performance. Windows Server 2019 is NUMA-aware so it is ok to disable node interleaving.

Next step, set CPU affinity for the 2 VMs. This can be set either from Task Manager (on the vmware-vmx.exe process) or via vmx configuration to have RAM locality and retain as much L1/L2/L3 cache as possible. Getting a VM scheduled on one CPU socket and then swapped out to the other CPU next time clears away whatever L1/L2/L3 cache that processes inside the VM had apart from the more expensive RAM access to cross CPU instead of "local" RAM.

One VM would have the following lines in the vmx configuration while the other would have the TRUE and FALSE values flipped. So VM1 would always be using CPU0 while VM2 uses the other CPU1. By extension, any VM in this specific host hardware should only be configured with 16 vCPUs or less.

Processor0.use = "TRUE"
Processor1.use = "TRUE"
Processor2.use = "TRUE"
Processor3.use = "TRUE"
Processor4.use = "TRUE"
Processor5.use = "TRUE"
Processor6.use = "TRUE"
Processor7.use = "TRUE"
Processor8.use = "TRUE"
Processor9.use = "TRUE"
Processor10.use = "TRUE"
Processor11.use = "TRUE"
Processor12.use = "TRUE"
Processor13.use = "TRUE"
Processor14.use = "TRUE"
Processor15.use = "TRUE"
Processor16.use = "FALSE"
Processor17.use = "FALSE"
Processor18.use = "FALSE"
Processor19.use = "FALSE"
Processor20.use = "FALSE"
Processor21.use = "FALSE"
Processor22.use = "FALSE"
Processor23.use = "FALSE"
Processor24.use = "FALSE"
Processor25.use = "FALSE"
Processor26.use = "FALSE"
Processor27.use = "FALSE"
Processor28.use = "FALSE"
Processor29.use = "FALSE"
Processor30.use = "FALSE"
Processor31.use = "FALSE"

View solution in original post

DMancini_XDS · ‎03-13-2024

I've just now completed a full ePSA Hardware Test, and all tests passed.

RDPetruska · ‎03-14-2024

If you aren't using the virtual CD-ROM, leave it disconnected.

If you are running both of those guests at the same time, then you're using 16 cores, leaving nothing for your host OS to do any work. Try reducing the number of cores.

bluefirestorm · ‎03-14-2024

Is Hyper-V or any components that use Hyper-V such as VBS, WSL2 enabled on the host? Hyper-V on the host results in the slower Microsoft Hypervisor API being used instead of ring 0 Intel VT-x native calls. Check msinfo32 at the host or vmware.log of any VM and look for "Monitor Mode". The text "ULM"" indicates it is the slower hypervisor while "CPL0" is for the ring 0 Intel VT-x.

Since the host hardware has two CPUs, at the host UEFI, the "node interleaving" should be "disabled" for better performance. Windows Server 2019 is NUMA-aware so it is ok to disable node interleaving.

Next step, set CPU affinity for the 2 VMs. This can be set either from Task Manager (on the vmware-vmx.exe process) or via vmx configuration to have RAM locality and retain as much L1/L2/L3 cache as possible. Getting a VM scheduled on one CPU socket and then swapped out to the other CPU next time clears away whatever L1/L2/L3 cache that processes inside the VM had apart from the more expensive RAM access to cross CPU instead of "local" RAM.

One VM would have the following lines in the vmx configuration while the other would have the TRUE and FALSE values flipped. So VM1 would always be using CPU0 while VM2 uses the other CPU1. By extension, any VM in this specific host hardware should only be configured with 16 vCPUs or less.

Processor0.use = "TRUE"
Processor1.use = "TRUE"
Processor2.use = "TRUE"
Processor3.use = "TRUE"
Processor4.use = "TRUE"
Processor5.use = "TRUE"
Processor6.use = "TRUE"
Processor7.use = "TRUE"
Processor8.use = "TRUE"
Processor9.use = "TRUE"
Processor10.use = "TRUE"
Processor11.use = "TRUE"
Processor12.use = "TRUE"
Processor13.use = "TRUE"
Processor14.use = "TRUE"
Processor15.use = "TRUE"
Processor16.use = "FALSE"
Processor17.use = "FALSE"
Processor18.use = "FALSE"
Processor19.use = "FALSE"
Processor20.use = "FALSE"
Processor21.use = "FALSE"
Processor22.use = "FALSE"
Processor23.use = "FALSE"
Processor24.use = "FALSE"
Processor25.use = "FALSE"
Processor26.use = "FALSE"
Processor27.use = "FALSE"
Processor28.use = "FALSE"
Processor29.use = "FALSE"
Processor30.use = "FALSE"
Processor31.use = "FALSE"

DMancini_XDS · ‎03-14-2024

So from my research, I'm getting absolutely contradictory information on this.

There seem to be two camps:

You base your vCPU count off of the LOGICAL core count, and as long as loads aren't too high, you can even over-schedule.
You base your vCPU count off of the PHYSICAL core count, and don't even THINK about over-scheduling under any circumstance.

What's going on here?

DMancini_XDS · ‎03-14-2024

So far, the only significant things I've found to actually be available to change were (1) setting the processor affinity (2) disabling the Virtual CD-ROM in one VM (I had already disabled it in the other VM that had been freezing).

The server BIOS already had node interleaving disabled, Hyper-V / WSL was never enabled, and vmware.log showed CPL0 for the Monitor Mode.

Furthermore, after verifying all of the things suggested above (EXCEPT changing my vCPU counts) I ran a GIMPS/Prime95 torture test inside both VM's simultaneously just to see what would happen to the host CPU usage. With both VM's cramming 100% simultaneous CPU usage, the host was reporting..... 53% total utilization. I kind of don't think I'm over-scheduling resources. There was absolutely no instability anywhere during the test either.

bluefirestorm · ‎03-14-2024

Personally, I ignore #2. Intel VT-x has matured to the point that the chances of VMexits (a VM having to go back to the hypervisor, similar to what an ordinary process will context switch with the OS), and the cost of the transitions (number of clock cycles spent) is reduced. The virtual RAM is also managed by the CPU and no longer done through VMware software. And if you go back to the mid 2000s, when multicore were not so common and virtual RAM was managed by software; you HAD to run multiple VMs; that advice would be useless; sure it can sometimes be as slow as molasses but what choice do you have?

And if you have the HyperV involved, things will be slower and the chances of VM transition go up as HyperV API will be running at ring 2.

It really boils down to the workload of the VM(s) and the host.

Anyway, the 16vCPU limit for your R630 is about RAM locality and L1/L2/L3 cache. L1/L2 resides in each core shared by 2 hyperthreads, L3 cache is shared across cores in one socket. Once a process (whether on host or VM) has to cross over to another CPU, L1/L2/L3 cache built up is no longer available and has to go back to RAM and this time it is on another CPU.

For example, you don't want to get in a situation you assign 24vCPU to a database/application server VM, it retrieves a large amount of data using CPU0 and now has to do sorting and it gets assigned to CPU1 and now the VM has to ask CPU0 for that data in RAM that was already retrieved. CPU0 is now getting disturbed to get data in its RAM for CPU1 and unable to continuously do work for other processes (whether on the host or a VM). This is just a simplistic illustration to show the additional overhead when RAM locality and L1/L2/L3 cache advantages are lost.

If the dual CPUs were 20c/40t each, you'd still be fine to create a monster 32vCPU VM as RAM locality can still be achieved and still have high chance of retaining L1/L2/L3 cache.

DMancini_XDS · ‎03-14-2024

I'll have to continue to monitor carefully, but it appears that setting the processor affinity for each VM made ~~all the issues I was having disappear, including getting "untouchable" black screens in Workstation (an issue about which I started a separate thread).~~ the freezing issue stop.

DMancini_XDS · ‎03-29-2024

Just wanted to put one final update here — since I manually set the processor affinity for each virtual machine in their respective .vmx files, neither of my VMs has frozen even once in the last two weeks.

Thank you, again, blue.

All

Sanity Check - VM Freezing, No Obvious Reason (to Me)