VMware Cloud Community
kolibri76
Contributor
Contributor

Upgrade ESXI 7.0.1 - Fatal CPU mismatch

I try to ugprade my homelab ESXi 7.0.0 (Build 16324942) to ESXi 7.0.1 (Build 17168206) and receive the attached purple screen after rebooting the host. The CPU used is an INTEL Atom C2750 which should be still supported according the HCL.

Any ideas?Screenshot 2020-11-30 at 18.30.26.png

 

best regards

Martin

 

Reply
0 Kudos
30 Replies
asajm
Expert
Expert

Hi @kolibri76 

Kindly check VMware Compatibility Guide 

If you think your queries have been answered
Marking this response as "Solution " or "Kudo"
ASAJM
ashilkrishnan
VMware Employee
VMware Employee

Hi @kolibri76 ,

CPU seems to be compatible with 7.0 and 7.0.1 per VMware HCL. This could be due to other system hardware or  a CPU feature that is not supported. It's worth checking the hardware BIOS for any such features.

If rollback option is provided, try rolling back to original version and try upgrading to 7.0 instead of 7.0.1 to see if it reports similar issues.

kolibri76
Contributor
Contributor

Hi @ashilkrishnan 

Thank you for your help! The automatic rollback to 7.0 works (requires just a reboot). And 7.0 runs without a problem.

For what exact feature in the hardware BIOS should I look for? I must admit I have no idea, about what to check and where I can see if there is an incompatibility.

 

look forward for your response.

bluefirestorm
Champion
Champion

From the error it looks like cores 1 through 7 returned a different value for a register than that for core 0. Most likely this is a CPUID instruction call or a read of a MSR for CPU features. I don't know why core 0 would have a different reading from cores 1 though 7.

Fatal CPU mismatch on feature "Intel processor platform type identifier"; cpu7 value = 0x1004195c, but cpu0 value = 0x10041a5c

Basically two bits were flipped. 0x9 = binary 1001 while 0xa is binary 1010

After rolling back to ESXi 7.0, is the microcode still version 0x12d? It might be a microcode/BIOS update problem somehow didn't cover all cores.

I think you can check microcode level from ESXi command line or from a vmware.log of any VM and look for ucode.

kolibri76
Contributor
Contributor

@bluefirestorm 

Yes, current microcode version is 0x12d for all cores. But there is a difference between cpu0 and cpu1-7

[root@xxxxx:~] vsish -e cat /hardware/cpu/cpuList/0 | grep -i -E 'family|model|stepping|microcode|revision'

   Family:0x06 

   Model:0x4d 

   Stepping:0x08 

   Number of microcode updates:1

   Original Revision:0x00000121

   Current Revision:0x0000012d

[root@xxxxx:~] vsish -e cat /hardware/cpu/cpuList/1 | grep -i -E 'family|model|stepping|microcode|revision'

   Family:0x06 

   Model:0x4d 

   Stepping:0x08 

   Number of microcode updates:0

   Original Revision:0x0000012d

   Current Revision:0x0000012d

DESteffen
Contributor
Contributor

Any news abaout this fatal CPU mismatch.

I have the same with my Atom homelab and the atom CPU is on HCL.

Regards, Steffen

cjnot
Contributor
Contributor

I am experiencing the same issue - Intel Atom C2758 - Verified on HCL - 7.0 has been working fine - Upgrade to 7.0.1 via vCenter fails with similar purple screen. 

 

DESteffen
Contributor
Contributor

I have opened a ticket at Supermicro too. They would look at the Microcode, but maybe it is a problem of the ESXi query but nobody from VMware explain something about this message ...

cjnot
Contributor
Contributor

I tried taking the latest BIOS from Supermicro on a test machine in the lab, but the microcode still differs between core 0 and 1-7. 

Every C2700 series machine I've checked also has this microcode behavior, so I'm guessing that its going to get acknowledged as a bug.  These processor is extremely common and HCL for vSphere 7, so I'm guessing we will see a patch soon.

DESteffen
Contributor
Contributor

Update ESXi-7.0U1d-17551050 is out with Updates microcode updates. Anybody tested with against a Atom C2700 series CPU?

vbondzio
VMware Employee
VMware Employee

I'm _fairly_ sure this is a BIOS issues and this correlates with changes where we tightened up the test in between GA and U1 ... can you just try to boot with "microcodeUpdate=FALSE", can't check right now whether we try that before the HW test.

cjnot
Contributor
Contributor

I found two things out related to this issue:

1.  The Supermicro systems we are running (i.e. SYS-5018A-FTN4) are running with a HCL supported processor (Atom C2758), but the motherboard that comes with this system apparently is now only supported to vSphere 6 (https://www.vmware.com/resources/compatibility/search.php?deviceCategory=server&productid=43366&devi...)

2.  Supermicro doesn't seem to see this as an issue and won't be releasing a firmware update to fix it.

Is there a way to disable the microcode version check at boot time for these systems? 

 

DESteffen
Contributor
Contributor

Yes, it is a shame to SuperMicro.

I'm open a ticket too but SuperMicro want not update the Bios to let working the board again with ESXi 7.0U1!

vbondzio
VMware Employee
VMware Employee

This isn't just a microcode check, it tests the uniformity of all CPUs and I don't see a way to disable that. You might get lucky downgrading the BIOS to a version that doesn't update the CPU microcode (since ucode is always layered on power on, you can't "write" on silicon). If that fails, run the latest version that works and newer ones as nested instances ...

cjnot
Contributor
Contributor

I didn't think I would have another update around this issue, but I stumbled on something related to this issue. 

I have been running the 7.0.0 Build 16324942 code for some months now on these machines with no issue. 

Under this release, the microcode versions are all correct/consistent on all cores shown in the output below.  It is only when I try and move to 7.0.1 does the microcode version become inconsistent.  This now leads me to believe that this is not bios related, but related to the microcode update done by ESXi.

I tried catching one at the first boot after upgrade from 7.0.0->7.0.1 and inserting "microcodeUpdate=FALSE" at the end of the boot string, but I am still getting a purple screen showing microcode mismatch.

Is there another way to prevent/disable the microcode update at boot time?

[root@esxi3:~] vmware -l
VMware ESXi 7.0 GA
[root@esxi3:~] vsish -e cat /hardware/cpu/cpuList/0 | grep -i -E 'family|model|stepping|microcode|re
vision'
   Family:0x06 
   Model:0x4d 
   Stepping:0x08 
   Number of microcode updates:0
   Original Revision:0x0000012d
   Current Revision:0x0000012d
[root@esxi3:~] vsish -e cat /hardware/cpu/cpuList/1 | grep -i -E 'family|model|stepping|microcode|re
vision'
   Family:0x06 
   Model:0x4d 
   Stepping:0x08 
   Number of microcode updates:0
   Original Revision:0x0000012d
   Current Revision:0x0000012d
[root@esxi3:~] vsish -e cat /hardware/cpu/cpuList/2 | grep -i -E 'family|model|stepping|microcode|re
vision'
   Family:0x06 
   Model:0x4d 
   Stepping:0x08 
   Number of microcode updates:0
   Original Revision:0x0000012d
   Current Revision:0x0000012d
[root@esxi3:~] vsish -e cat /hardware/cpu/cpuList/3 | grep -i -E 'family|model|stepping|microcode|re
vision'
   Family:0x06 
   Model:0x4d 
   Stepping:0x08 
   Number of microcode updates:0
   Original Revision:0x0000012d
   Current Revision:0x0000012d
[root@esxi3:~] vsish -e cat /hardware/cpu/cpuList/4 | grep -i -E 'family|model|stepping|microcode|re
vision'
   Family:0x06 
   Model:0x4d 
   Stepping:0x08 
   Number of microcode updates:0
   Original Revision:0x0000012d
   Current Revision:0x0000012d
[root@esxi3:~] vsish -e cat /hardware/cpu/cpuList/5 | grep -i -E 'family|model|stepping|microcode|re
vision'
   Family:0x06 
   Model:0x4d 
   Stepping:0x08 
   Number of microcode updates:0
   Original Revision:0x0000012d
   Current Revision:0x0000012d
[root@esxi3:~] vsish -e cat /hardware/cpu/cpuList/6 | grep -i -E 'family|model|stepping|microcode|re
vision'
   Family:0x06 
   Model:0x4d 
   Stepping:0x08 
   Number of microcode updates:0
   Original Revision:0x0000012d
   Current Revision:0x0000012d
[root@esxi3:~] vsish -e cat /hardware/cpu/cpuList/7 | grep -i -E 'family|model|stepping|microcode|re
vision'
   Family:0x06 
   Model:0x4d 
   Stepping:0x08 
   Number of microcode updates:0
   Original Revision:0x0000012d
   Current Revision:0x0000012d

 

vbondzio
VMware Employee
VMware Employee

Is that the same host that fails when updated to U1? If yes, are you currently booting it with "microcodeUpdate=FALSE"? Can you upload the /var/log/boot.gz from one boot with and one without on the pre U1 build?

mchaker
VMware Employee
VMware Employee

I'm glad I found this thread!

I'm facing the exact same issue.

cat /proc/cpuinfo (on Linux, which works) shows:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 76
model name      : Intel(R) Atom(TM) x5-Z8350  CPU @ 1.44GHz
stepping        : 4
microcode       : 0x411
cpu MHz         : 1066.913
cache size      : 1024 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 6
initial apicid  : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch epb pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat md_clear
bugs            : cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
bogomips        : 2880.00
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual

 

I've tried enabling "Limit CPUID Max" in the BIOS and the system wouldn't boot past the BIOS, so I had to reset the BIOS.

Seeing that the CPU works in Windows and Linux, I think this is an ESXi issue, as the thread also points to.

In my case, I can't even boot into the ESXi 7.0.1 installer, so I don't think I can get /var/log/boot.gz.

Is there other information that would help?

vbondzio
VMware Employee
VMware Employee

> Seeing that the CPU works in Windows and Linux, I think this is an ESXi issue (...)

One could also argue that ESXi is more thorough in comparing and testing on boot, given that this started to show in 7.0 U1 when we tightened up the checks. I'm not saying there isn't a chance that this is indeed ESXi misbehaving but I guess my default assumption is innocent until proven guilty 🙂

Given our shared employer, I'll ping you internally for logs that could help.

TimMann
VMware Employee
VMware Employee

This is actually a bug. Intel specifies only bits 52:50 of MSR 0x17 to be architectural. The other bits are reserved, so the vmkernel shouldn't insist that those bits be the same across all cpus. In the PSOD screen, only some of the lower-order bits differ. Bits 52:50 are zero in both.

Note that MSR 0x17 is not the microcode revision. It's the platform ID; see Intel documentation for details.

You can boot with cpuUniformityHardCheckPanic=FALSE to work around this.

Thanks to Valentin for alerting me to it.

 

Tags (1)