VMware Communities
jeremym
Enthusiast
Enthusiast

VMW 17 + Change vCPU on Win Server 2022 Guest = Guest BSOD

Lashup of my issue:

1. Host is Windows 10 - latest - all patches, yada yada. Host running VMW 17.0.2.

2. Guest is Server 2022.

3. Guest is running for months perfectly with Processors = 4.  2 Proc 2 Cores = Total Processor Cores = 4.

4. I need a little boost. So I power off the guest machine. Up the processors to 4 / 2 = Total Processor Cores = 8. Start VM. Result = BSOD (see below.) Inaccessible_Boot_Device.

5. Kicker is that it also wrecks the disk chain. See pic 2. 

6. No way to recover the guest except from nicely made backup. Zero snapshots work after this. If there are Zero snapshots it doesn't matter, the guest is dead anyway; no recovery.

7. 100% reproducable again and again on this VM.  Reproducable when guest has snapshots OR guest has ZERO snapshots.

7b. Reproducable with ANY change to virtual CPU. Not just the desired increase.

8. Can change memory OK... but not processors.

9. NOTE: I am NOT trying to LIVE change the processors. Guest is nicely DOWN; THEN processors change.

Let me know what you'd like me to attach; and I'll do it for analysis. Thanks Team !

jeremym_0-1683598750292.png

jeremym_1-1683598793209.png

 

 

Labels (3)
Tags (3)
0 Kudos
13 Replies
Technogeezer
Immortal
Immortal

Can you share info on the Windows 10 host hardware? How many physical (not hyper threaded) cores do you have? 

- Paul (Technogeezer)
Editor of the Unofficial Fusion Companion Guides
0 Kudos
jeremym
Enthusiast
Enthusiast

Host hardware is Lenovo P1. Processor is Intel Core 9 (yes, 9) i9-1088H CPU.  Product sheet says 8 total cores. https://ark.intel.com/content/www/us/en/ark/products/203682/intel-core-i910885h-processor-16m-cache-... 

0 Kudos
RDPetruska
Leadership
Leadership

You can't starve the host... you need to keep AT LEAST 1 unused physical core for the host.

0 Kudos
jeremym
Enthusiast
Enthusiast

1. Same VM fails even I increase from 4 to 6 total (and not utilize 8 like I desire). (See fail-vm1.png)

2. Different VM succeeds even if I increase from 4 to 8 total. (See success-vm1.png.)

 

 

0 Kudos
Mikero
Community Manager
Community Manager

I've let our platform team know, but they're going to want to see logs.

For instance, we have 2 very different very low-level modes that we work under. One uses our entire API stack, one uses a combination of ours and Microsoft's Windows Hypervisor Platform. They have very different behavior. It's unclear which mode this failure scenario is under.

A 'vmware.log' file copied out when the VM has been powered off after the BSOD would be most helpful.

-
Michael Roy - Product Marketing Engineer: VCF
0 Kudos
jeremym
Enthusiast
Enthusiast

You got it. Steps were:

1. Change processors.

2. Powered on.

3. Wait for Blue screen.

4. Power off.

5. Upload logs.

Thanks !

0 Kudos
joshiga
Enthusiast
Enthusiast

@jeremym If possible can you please try editing your VM's configuration file

for the VM where you see BSOD. Add below flag:

  monitor_control.disable_apichv = "TRUE"

and see if that helps.

Let us know how it goes. 

Thanks,

Gaurav

0 Kudos
Mikero
Community Manager
Community Manager

As a quick update... the team (which @joshiga is a vital part of) thinks that there might be multiple bugs in play here. 

That said, checking if the issue persists with virtualized APIC disabled is important because it helps us divide the line between different bugs. It's a fun combination of us, CPU and MS bugs all coming together for a big crash party, so we're trying to lay out the best path forward without pointing too many fingers or having to wait for OS or CPU Microcode updates.

monitor_control.disable_apichv = "TRUE"

 

-
Michael Roy - Product Marketing Engineer: VCF
0 Kudos
jeremym
Enthusiast
Enthusiast

You got it.

So add that line to the VMX .. wait for crash again and upload logs?

Just making sure I understand the directive. Thx !

PS: I upgraded the machine to VMW17-VM level today just for fun to see if it helped. It didn't 🙂

-JM

0 Kudos
Mikero
Community Manager
Community Manager


@jeremym wrote:

So add that line to the VMX .. wait for crash again and upload logs?


Precisely 🙂

-
Michael Roy - Product Marketing Engineer: VCF
0 Kudos
jeremym
Enthusiast
Enthusiast

Ohhhhkay. Well that made it NOT CRASH. See results below. Also see attached LOG file as requested. Sorry for the little delay. Thanks Team ! PS: Can I keep running like this or is there some reason not to / performance impact? I would say coming up / booting appeared slower than usual, but I didn't make any scientific A:B test. LMK. Thx. - Jeremy

jeremym_0-1684682903398.png

 

 

0 Kudos
jeremym
Enthusiast
Enthusiast

<bump>

Just a quick reminder I did as asked with the .VMX entry and it worked / didn't crash.

What does that tell you / does this help you?

And, is it safe to continue to run this way (until permanent fix comes around?)

Tags (1)
0 Kudos
joshiga
Enthusiast
Enthusiast

Hey @jeremym 

 

Response inline for the queries:

What does that tell you / does this help you?

> The BSOD was caused by a HyperV interrupt delivery issue at the host
level. We have reported this and MSFT folks have been working on the fix.

And, is it safe to continue to run this way (until permanent fix comes around?

> Setting monitor_control.disable_apichv=TRUE disables the interrupt
system performance optimization; it is safe to run with the workaround,
There could be a performance degradation, which i think you have already noticed. 

 

0 Kudos