Did anybody else experience CPU thermal issues after patching to latest ESX 7.0b 16324942 build?
I'm using Supermicro SYS-E300-9D-8CN8TP with Xeon D-2146NT. BIOS 1.3
CPU temperature goes berzerk during ESX boot. Same host on Ubuntu-Live CD is fine.
4 hosts affected. 3 with critical temperature >98C and one >80C. Problem occured right after patching and host reboot. Cluster was fine on 7.0 GA.
I noticed there was a microcode update included.
Read details on findings here: https://www.elasticsky.de/en/2020/07/strange-thermal-issue-after-update-to-esxi-7-0b/
I guess it's system specific. Yet the E300-9D is HCL compliant.
Can anyone confirm?
Just to keep things in the public eye: Supermicro's initial response was to push back and state that they haven't validated this model for vSphere, however I've politely pointed them to the VMware Compatibility Guide - System Search . My gut tells me that we need a CPU microcode fix via an updated BIOS, but I need to validate the version running on the host a little later today when I have some free time. I'll keep y'all updated here
Yes, the microcode patch is the hottest candidate :smileydevil: to cause trouble.
BIOS Version 1.3 is the latest to get from Supermicro. And it's also the required version in HCL.
The SG bulletin also includes the CPU microcode, so I suspect we'll see the same issues. I'm seeing microcode version 0x20000069 if I boot Ubuntu 20.04 livecd, but I'm not sure how to check if that is loaded by the BIOS or the kernel. Will try an older Ubuntu (maybe 16.04?) and check to see what the microcode shows as there.
Further updates: I checked esxtop, and saw that all CPU cores were idle. CPU temp jumps from 66 degrees to 98 degrees as soon as the loadscreen flips from loading the modules to initialising VMkernel.
And now for the interesting part: I forgot to shut down at the end of the last boot, and went for lunch. On returning from lunch I saw that the CPU temps were more or less normal, so it seems that something very strange is happening - it was at 98 degrees for at least an hour before I went to lunch, so although esxtop shows no load, the CPU is still kicking out a lot of heat. Time to grab something to graph the temperature, I think.
That's an interesting observation!
Either there's a process that takes an hour to finish (once), or it's going to happen again after each reboot.
I didn't dare to roast my CPU that long. 🙂
So is the CPU really getting that hot, or is maybe the temperature sensor reporting wrong data?
Not sure how to check this ...
I will try to verify this.
But I've noticed the fans rotating at max speed and blowing a lot of warm air out of the case. So I guess the temperature is real.
Also the boot process with 7.0b is much slower than a regular boot with 7.0 GA. Maybe the system throtteled CPU clock speed to protect the CPU from damage.
Frustratingly, despite leaving the box running at that temperature for 3 hours it doesn't seem to have calmed down. I'm putting that down to "one of those things"
I've rolled both boxes back to GA and all is fine for now - trying to figure the SNMP MIBs required to monitor the thermal sensors on these boxes - the Supermicro docs are not great.
Same issue with a Supermicro board X11SDV-8C-TP8F
Spent a lot of time debugging this to come to the same conclusion.
Update 16324942 messed something up.
Strangely, 90% of the time when the system starts to boot and the sensor on my Supermicro goes red, I just unplug the PSU for several seconds, plug it back and with a bit of luck, ESXi will load just fine.
Opened a ticket with Supermicro also.
Got a reply from Supermicro pointing out that that my system has not been tested with any 7.0 release and that they cannot help.
Which is true: VMware Compatibility Guide - System Search
Then again, it has been tested with 6.7 U3 VMware Compatibility Guide - System Search
Even later edit:
just realized SYS-E300-9D-8CN8TP is using the same motherboard as my SYS-5019D-FN8TP
If SYS-E300-9D-8CN8TP has been tested with 7.0 I wonder why my SYS-5019D-FN8TP is not listed as tested.
What are they testing? The server case? /s
In the interests of gathering all of the relevant info in one location would you mind sharing your SM support ticket number, RazvanConstantin ? DM would be fine
I'm trying to get the support team at Supermicro to see that this is a wider issue, not just my 2 boxes.
Can you also file an SR with VMware if you're entitled to VMware support?
You can check the microcode version that ESXi loaded with this command from an ESXi console shell:
vsish -e get /hardware/cpu/cpuList/0 | grep Revision
"Original Revision" is what ESXi saw at boot time (usually loaded by the BIOS), and "Current Revision" is what's running now, possibly loaded by ESXi.
I've looked and can't find any microcode issues documented by Intel with symptoms similar to what's described in this thread. For Skylake-D, I believe the latest microcode that's publicly released as of today is 0x02006906. ESXi 7.0p01 includes the slightly older 0x02006901. Version 0x02000069 is still a little older.
You can try having ESXi not load any microcode by running this command and then doing a clean shutdown/reboot:
esxcli system settings kernel set -s microcodeUpdate -v false
The next ESXi release will package 0x02006906. However, the fix from Supermicro might have been elsewhere in the BIOS -- it's possible that using 0x02006906 with an old BIOS might still cause the same problem, or it may not. That's unclear from what little information I have. So updating your BIOS to the latest from Supermicro is the best thing to do.