I've 2 hosts, running identical BIOS. One does this (hitting 99 degrees), the other seems to be fine. I've opened a ticket with Supermicro, let's see what they say...
Just to keep things in the public eye: Supermicro's initial response was to push back and state that they haven't validated this model for vSphere, however I've politely pointed them to the VMware Compatibility Guide - System Search . My gut tells me that we need a CPU microcode fix via an updated BIOS, but I need to validate the version running on the host a little later today when I have some free time. I'll keep y'all updated here
Yes, the microcode patch is the hottest candidate to cause trouble.
BIOS Version 1.3 is the latest to get from Supermicro. And it's also the required version in HCL.
1 person found this helpful
The SG bulletin also includes the CPU microcode, so I suspect we'll see the same issues. I'm seeing microcode version 0x20000069 if I boot Ubuntu 20.04 livecd, but I'm not sure how to check if that is loaded by the BIOS or the kernel. Will try an older Ubuntu (maybe 16.04?) and check to see what the microcode shows as there.
Checking in an older version of Ubuntu shows that same microcode, so I guess it must becoming from the BIOS.
Further updates: I checked esxtop, and saw that all CPU cores were idle. CPU temp jumps from 66 degrees to 98 degrees as soon as the loadscreen flips from loading the modules to initialising VMkernel.
And now for the interesting part: I forgot to shut down at the end of the last boot, and went for lunch. On returning from lunch I saw that the CPU temps were more or less normal, so it seems that something very strange is happening - it was at 98 degrees for at least an hour before I went to lunch, so although esxtop shows no load, the CPU is still kicking out a lot of heat. Time to grab something to graph the temperature, I think.
That's an interesting observation!
Either there's a process that takes an hour to finish (once), or it's going to happen again after each reboot.
I didn't dare to roast my CPU that long. :-)
So is the CPU really getting that hot, or is maybe the temperature sensor reporting wrong data?
Not sure how to check this ...
1 person found this helpful
I will try to verify this.
But I've noticed the fans rotating at max speed and blowing a lot of warm air out of the case. So I guess the temperature is real.
Also the boot process with 7.0b is much slower than a regular boot with 7.0 GA. Maybe the system throtteled CPU clock speed to protect the CPU from damage.
Frustratingly, despite leaving the box running at that temperature for 3 hours it doesn't seem to have calmed down. I'm putting that down to "one of those things"
I've rolled both boxes back to GA and all is fine for now - trying to figure the SNMP MIBs required to monitor the thermal sensors on these boxes - the Supermicro docs are not great.
Same issue with a Supermicro board X11SDV-8C-TP8F
Spent a lot of time debugging this to come to the same conclusion.
Update 16324942 messed something up.
Strangely, 90% of the time when the system starts to boot and the sensor on my Supermicro goes red, I just unplug the PSU for several seconds, plug it back and with a bit of luck, ESXi will load just fine.
Opened a ticket with Supermicro also.
Got a reply from Supermicro pointing out that that my system has not been tested with any 7.0 release and that they cannot help.
Which is true: VMware Compatibility Guide - System Search
Then again, it has been tested with 6.7 U3 VMware Compatibility Guide - System Search
Even later edit:
just realized SYS-E300-9D-8CN8TP is using the same motherboard as my SYS-5019D-FN8TP
If SYS-E300-9D-8CN8TP has been tested with 7.0 I wonder why my SYS-5019D-FN8TP is not listed as tested.
What are they testing? The server case? /s
In the interests of gathering all of the relevant info in one location would you mind sharing your SM support ticket number, RazvanConstantin ? DM would be fine
I'm trying to get the support team at Supermicro to see that this is a wider issue, not just my 2 boxes.
I too have the same issue with the same system and have opened a case with Supermicro.
Strange thing is I own 2 of these systems and only one has this issue.
In case you are still interested, I found the oid for CPU temperature.