VMW 17 + Change vCPU on Win Server 2022 Guest = Gu...

jeremym · ‎05-08-2023

Lashup of my issue:

1. Host is Windows 10 - latest - all patches, yada yada. Host running VMW 17.0.2.

2. Guest is Server 2022.

3. Guest is running for months perfectly with Processors = 4. 2 Proc 2 Cores = Total Processor Cores = 4.

4. I need a little boost. So I power off the guest machine. Up the processors to 4 / 2 = Total Processor Cores = 8. Start VM. Result = BSOD (see below.) Inaccessible_Boot_Device.

5. Kicker is that it also wrecks the disk chain. See pic 2.

6. No way to recover the guest except from nicely made backup. Zero snapshots work after this. If there are Zero snapshots it doesn't matter, the guest is dead anyway; no recovery.

7. 100% reproducable again and again on this VM. Reproducable when guest has snapshots OR guest has ZERO snapshots.

7b. Reproducable with ANY change to virtual CPU. Not just the desired increase.

8. Can change memory OK... but not processors.

9. NOTE: I am NOT trying to LIVE change the processors. Guest is nicely DOWN; THEN processors change.

Let me know what you'd like me to attach; and I'll do it for analysis. Thanks Team !

Technogeezer · ‎05-08-2023

Can you share info on the Windows 10 host hardware? How many physical (not hyper threaded) cores do you have?

- Paul (Technogeezer)
Editor of the Unofficial Fusion Companion Guides

jeremym · ‎05-09-2023

Host hardware is Lenovo P1. Processor is Intel Core 9 (yes, 9) i9-1088H CPU. Product sheet says 8 total cores. https://ark.intel.com/content/www/us/en/ark/products/203682/intel-core-i910885h-processor-16m-cache-...

RDPetruska · ‎05-09-2023

You can't starve the host... you need to keep AT LEAST 1 unused physical core for the host.

jeremym · ‎05-09-2023

1. Same VM fails even I increase from 4 to 6 total (and not utilize 8 like I desire). (See fail-vm1.png)

2. Different VM succeeds even if I increase from 4 to 8 total. (See success-vm1.png.)

Mikero · ‎05-11-2023

I've let our platform team know, but they're going to want to see logs.

For instance, we have 2 very different very low-level modes that we work under. One uses our entire API stack, one uses a combination of ours and Microsoft's Windows Hypervisor Platform. They have very different behavior. It's unclear which mode this failure scenario is under.

A 'vmware.log' file copied out when the VM has been powered off after the BSOD would be most helpful.

-
Michael Roy - Product Marketing Engineer: VCF

jeremym · ‎05-11-2023

You got it. Steps were:

1. Change processors.

2. Powered on.

3. Wait for Blue screen.

4. Power off.

5. Upload logs.

Thanks !

joshiga · ‎05-15-2023

@jeremym If possible can you please try editing your VM's configuration file

for the VM where you see BSOD. Add below flag:

monitor_control.disable_apichv = "TRUE"

and see if that helps.

Let us know how it goes.

Thanks,

Gaurav

Mikero · ‎05-18-2023

As a quick update... the team (which @joshiga is a vital part of) thinks that there might be multiple bugs in play here.

That said, checking if the issue persists with virtualized APIC disabled is important because it helps us divide the line between different bugs. It's a fun combination of us, CPU and MS bugs all coming together for a big crash party, so we're trying to lay out the best path forward without pointing too many fingers or having to wait for OS or CPU Microcode updates.

monitor_control.disable_apichv = "TRUE"

-
Michael Roy - Product Marketing Engineer: VCF

jeremym · ‎05-18-2023

You got it.

So add that line to the VMX .. wait for crash again and upload logs?

Just making sure I understand the directive. Thx !

PS: I upgraded the machine to VMW17-VM level today just for fun to see if it helped. It didn't 🙂

-JM

Mikero · ‎05-18-2023

@jeremym wrote:

So add that line to the VMX .. wait for crash again and upload logs?

Precisely 🙂

-
Michael Roy - Product Marketing Engineer: VCF

jeremym · ‎05-21-2023

Ohhhhkay. Well that made it NOT CRASH. See results below. Also see attached LOG file as requested. Sorry for the little delay. Thanks Team ! PS: Can I keep running like this or is there some reason not to / performance impact? I would say coming up / booting appeared slower than usual, but I didn't make any scientific A:B test. LMK. Thx. - Jeremy

jeremym · ‎05-25-2023

<bump>

Just a quick reminder I did as asked with the .VMX entry and it worked / didn't crash.

What does that tell you / does this help you?

And, is it safe to continue to run this way (until permanent fix comes around?)

joshiga · ‎05-30-2023

Hey @jeremym

Response inline for the queries:

What does that tell you / does this help you?

> The BSOD was caused by a HyperV interrupt delivery issue at the host
level. We have reported this and MSFT folks have been working on the fix.

And, is it safe to continue to run this way (until permanent fix comes around?

> Setting monitor_control.disable_apichv=TRUE disables the interrupt
system performance optimization; it is safe to run with the workaround,
There could be a performance degradation, which i think you have already noticed.

All

VMW 17 + Change vCPU on Win Server 2022 Guest = Guest BSOD

BSOD

server 2022

VMW17