microlytix
Enthusiast
Enthusiast

strange thermal issue after patching to ESXi 7.0b

Did anybody else experience CPU thermal issues after patching to latest ESX 7.0b 16324942 build?

I'm using Supermicro SYS-E300-9D-8CN8TP with Xeon D-2146NT. BIOS 1.3

CPU temperature goes berzerk during ESX boot. Same host on Ubuntu-Live CD is fine.

4 hosts affected. 3 with critical temperature >98C and one >80C. Problem occured right after patching and host reboot. Cluster was fine on 7.0 GA.

I noticed there was a microcode update included.

Read details on findings here: https://www.elasticsky.de/en/2020/07/strange-thermal-issue-after-update-to-esxi-7-0b/

I guess it's system specific. Yet the E300-9D is HCL compliant.

Can anyone confirm?

TIA

Michael

blog: https://www.elasticsky.de/en
22 Replies
Kev_Johnson
Enthusiast
Enthusiast

I've 2 hosts, running identical BIOS. One does this (hitting 99 degrees), the other seems to be fine. I've opened a ticket with Supermicro, let's see what they say...

Kev_Johnson
Enthusiast
Enthusiast

Just to keep things in the public eye: Supermicro's initial response was to push back and state that they haven't validated this model for vSphere, however I've politely pointed them to the VMware Compatibility Guide - System Search . My gut tells me that we need a CPU microcode fix via an updated BIOS, but I need to validate the version running on the host a little later today when I have some free time. I'll keep y'all updated here Smiley Happy

0 Kudos
microlytix
Enthusiast
Enthusiast

Yes, the microcode patch is the hottest candidate :smileydevil:  to cause trouble.

BIOS Version 1.3 is the latest to get from Supermicro. And it's also the required version in HCL.

blog: https://www.elasticsky.de/en
0 Kudos
Kev_Johnson
Enthusiast
Enthusiast

The SG bulletin also includes the CPU microcode, so I suspect we'll see the same issues. I'm seeing microcode version 0x20000069 if I boot Ubuntu 20.04 livecd, but I'm not sure how to check if that is loaded by the BIOS or the kernel. Will try an older Ubuntu (maybe 16.04?) and check to see what the microcode shows as there.

Kev_Johnson
Enthusiast
Enthusiast

Checking in an older version of Ubuntu shows that same microcode, so I guess it must becoming from the BIOS.

0 Kudos
Kev_Johnson
Enthusiast
Enthusiast

Further updates: I checked esxtop, and saw that all CPU cores were idle. CPU temp jumps from 66 degrees to 98 degrees as soon as the loadscreen flips from loading the modules to initialising VMkernel.

And now for the interesting part: I forgot to shut down at the end of the last boot, and went for lunch. On returning from lunch I saw that the CPU temps were more or less normal, so it seems that something very strange is happening - it was at 98 degrees for at least an hour before I went to lunch, so although esxtop shows no load, the CPU is still kicking out a lot of heat. Time to grab something to graph the temperature, I think.

0 Kudos
microlytix
Enthusiast
Enthusiast

That's an interesting observation!

Either there's a process that takes an hour to finish (once), or it's going to happen again after each reboot.

I didn't dare to roast my CPU that long. 🙂

blog: https://www.elasticsky.de/en
0 Kudos
peetz
Leadership
Leadership

So is the CPU really getting that hot, or is maybe the temperature sensor reporting wrong data?

Not sure how to check this ...

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de
0 Kudos
microlytix
Enthusiast
Enthusiast

Good point.

I will try to verify this.

But I've noticed the fans rotating at max speed and blowing a lot of warm air out of the case. So I guess the temperature is real.

Also the boot process with 7.0b is much slower than a regular boot with 7.0 GA. Maybe the system throtteled CPU clock speed to protect the CPU from damage.

blog: https://www.elasticsky.de/en
Kev_Johnson
Enthusiast
Enthusiast

Frustratingly, despite leaving the box running at that temperature for 3 hours it doesn't seem to have calmed down. I'm putting that down to "one of those things"

I've rolled both boxes back to GA and all is fine for now - trying to figure the SNMP MIBs required to monitor the thermal sensors on these boxes - the Supermicro docs are not great.

RazvanConstanti
Contributor
Contributor

Same issue with a Supermicro board X11SDV-8C-TP8F

Spent a lot of time debugging this to come to the same conclusion.

Update 16324942 messed something up.

Strangely, 90% of the time when the system starts to boot and the sensor on my Supermicro goes red, I just unplug the PSU for several seconds, plug it back and with a bit of luck, ESXi will load just fine.

Opened a ticket with Supermicro also.

later edit:

Got a reply from Supermicro pointing out that that my system has not been tested with any 7.0 release and that they cannot help.

Which is true: VMware Compatibility Guide - System Search

Then again, it has been tested with 6.7 U3 VMware Compatibility Guide - System Search

Even later edit:

just realized SYS-E300-9D-8CN8TP is using the same motherboard as my SYS-5019D-FN8TP

If SYS-E300-9D-8CN8TP has been tested with 7.0 I wonder why my SYS-5019D-FN8TP is not listed as tested.

What are they testing? The server case? /s

Kev_Johnson
Enthusiast
Enthusiast

In the interests of gathering all of the relevant info in one location would you mind sharing your SM support ticket number, RazvanConstantin ? DM would be fine Smiley Happy

I'm trying to get the support team at Supermicro to see that this is a wider issue, not just my 2 boxes.

0 Kudos
Han
Enthusiast
Enthusiast

I too have the same issue with the same system and have opened a case with Supermicro.

Strange thing is I own 2 of these systems and only one has this issue.

0 Kudos
RazvanConstanti
Contributor
Contributor

In case you are still interested, I found the oid for CPU temperature.

1.3.6.1.4.1.21317.1.3.1.2.1

Kev_Johnson
Enthusiast
Enthusiast

Thanks!

I've had a response from Supermicro support: they've now flagged this to the product manager, so hopefully we'll see some form of more helpful response soon.

TimMann
VMware Employee
VMware Employee

Can you also file an SR with VMware if you're entitled to VMware support?

You can check the microcode version that ESXi loaded with this command from an ESXi console shell:

  vsish -e get /hardware/cpu/cpuList/0 | grep Revision

"Original Revision" is what ESXi saw at boot time (usually loaded by the BIOS), and "Current Revision" is what's running now, possibly loaded by ESXi.

I've looked and can't find any microcode issues documented by Intel with symptoms similar to what's described in this thread.  For Skylake-D, I believe the latest microcode that's publicly released as of today is 0x02006906. ESXi 7.0p01 includes the slightly older 0x02006901. Version 0x02000069 is still a little older.

You can try having ESXi not load any microcode by running this command and then doing a clean shutdown/reboot:

esxcli system settings kernel set -s microcodeUpdate -v false

0 Kudos
Han
Enthusiast
Enthusiast

After using "esxcli system settings kernel set -s microcodeUpdate -v false" and rebooting the CPU temperature is normal.

Han
Enthusiast
Enthusiast

Supermicro sent me a new pre-release BIOS and that solves the problem.

Microcode version in that BIOS is: 0x02006906

TimMann
VMware Employee
VMware Employee

The next ESXi release will package 0x02006906. However, the fix from Supermicro might have been elsewhere in the BIOS -- it's possible that using 0x02006906 with an old BIOS might still cause the same problem, or it may not. That's unclear from what little information I have. So updating your BIOS to the latest from Supermicro is the best thing to do.