VMware Cloud Community
Memnarch
Enthusiast
Enthusiast

CPU scheduler "automatic relation removal" causing vm freeze, esxi 6.7U2, threadripper, gpu

Interesting problem, I can't find another example of it listed anywhere.

4 win 10 vm's on esxi 6.7U2 host, each with a GPU pass through and therefore reserved memory.  NUMA enabled in BIOS. 8 vcpu's per vm on  a 16 core threadripper 1950x.

Each vm gets occasional (every few hours of use) freezes where the running application becomes unresponsive. Task manager is still accessible but you can't actually do anything from it.  If the vm is power cycled from the host it reboots normally. Host does not freeze or crash and other vm's unaffected.

The host error log always gives something like this:

2019-06-26T23:51:42.491Z cpu26:2097396)WARNING: CpuSched: 996: Automatic relation removal from 2100712(vmx-vcpu-0:VM3, VM1) to 2100713(PVSCSI-2100577:1)

2019-06-26T23:51:51.491Z cpu22:2097396)WARNING: CpuSched: 996: Automatic relation removal from 2100566(vmx-vcpu-0:VM2, VM1) to 2100567(LSI-2100430:0)

where "VM1" is the name of a vm.

VM1 is always involved, others may or may not be; and the second listing is always PVSCI or LSI  (which are virtual SCSI adapters.) The freezes can happen on ANY of the 4 vm's, but vm1 is the least stable. Freezes only occur when the VM (that freezes) is under load, and not at idle.  VM1 load doesn't seem to freeze the other vm's, only itself, but with freezes this intermittent it's hard to be sure of.

I haven't changed the default vm cpu scheduler. Latest BIOS / AGESA.

Since these warnings always accompany a freeze I assume they are related, but I can't find any documentation of what they mean let alone how to resolve them.

Anyone have a clue?

Thanks,

LT

0 Kudos
1 Reply
Memnarch
Enthusiast
Enthusiast

Hi all-- in case anyone else ever runs into this--

the VM in question had bad USB driver    that was causing it to hang.  The vmware warnings above happened on rebooting the hanged vm, not and the crash itself.

0 Kudos