VMware Cloud Community
IamTHEvilONE
Immortal
Immortal

ESXi 6.7 PSOD on AMD White Box Host

I did the best I could to capture the PSOD itself in the attached image.  Just checking to see if anyone else may have encountered this or similar before.

This system is a Ryzen 1700 on an AB350 based motherboard.  lspci shows this:

0000:00:00.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex

0000:00:00.2 Generic system peripheral: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit

0000:00:01.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge

0000:00:01.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [PCIe RP[0000:00:01.1]]

0000:00:01.3 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [PCIe RP[0000:00:01.3]]

0000:00:02.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge

0000:00:03.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge

0000:00:03.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [PCIe RP[0000:00:03.1]]

0000:00:04.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge

0000:00:07.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge

0000:00:07.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [PCIe RP[0000:00:07.1]]

0000:00:08.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge

0000:00:08.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [PCIe RP[0000:00:08.1]]

0000:00:14.0 Serial bus controller: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller

0000:00:14.3 Bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge

0000:00:18.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0

0000:00:18.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1

0000:00:18.2 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2

0000:00:18.3 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3

0000:00:18.4 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4

0000:00:18.5 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5

0000:00:18.6 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6

0000:00:18.7 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7

0000:01:00.0 Mass storage controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [vmhba1]

0000:02:00.0 Serial bus controller: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset USB 3.1 xHCI Controller

0000:02:00.1 Mass storage controller: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset SATA Controller [vmhba2]

0000:02:00.2 Bridge:

0000:03:00.0 Bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port

0000:03:01.0 Bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port

0000:03:04.0 Bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port

0000:04:00.0 Network controller: Realtek Semiconductor Co., Ltd. Onboard Ethernet

0000:06:00.0 Network controller: Intel Corporation PRO/1000 PT Dual Port Server Adapter [vmnic0]

0000:06:00.1 Network controller: Intel Corporation PRO/1000 PT Dual Port Server Adapter [vmnic1]

0000:07:00.0 Display controller: NVIDIA Corporation GK208B [GeForce GT 710]

0000:07:00.1 Multimedia controller: NVIDIA Corporation GK208 HDMI/DP Audio Controller

0000:08:00.0 :

0000:08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor

0000:08:00.3 Serial bus controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller

0000:09:00.0 :

0000:09:00.2 Mass storage controller: Advanced Micro Devices Inc AMD FCH SATA Controller [AHCI Mode] [vmhba0]

I do have a Crucial BX300 SSD on the SATA connections.  The odd thing about it is that it's showing latency in the order of 10-15 ms with minimal IO to that when in use.  This system was stable under 6.5 and similar load/guest OSs (nested ESXi hosts), but that was strictly using the Samsung NVMe SSD + NAS storage (the crucial drive wasn't connected ever).  I'm going to offload the Crucial disk and just use the NVMe Samsung and NAS again for a bit to see if that stabilizes the system for more than a week.

It took about 2-3 weeks of uptime to finally PSOD and produce the issue again (Second time) so I could actually capture data.

Tags (2)
0 Kudos
3 Replies
IamTHEvilONE
Immortal
Immortal

after talking around ... it seems like this may be due to the AMD style of Hyperthreading.  There were issues with this in ESXi 6.5 not too long ago.  I've disabled the SMT and seeing how that goes.

0 Kudos
Sinorama
Enthusiast
Enthusiast

The problem may be related to the motherboard. Which AB350 is it exactly? I have had bad results with ASRock AB350M Pro 4 with Ryzen 1700 and esxi. The setup worked for the first month and then started to PSOD, reboot or freeze during idle but never when active.

0 Kudos
IamTHEvilONE
Immortal
Immortal

The motherboard = Gigabyte AB350-M Gaming3

Disabling the SMT has returned stability to the ESXi host. Current uptime of over 24 days.

I think it narrows the issue to something with CPU/Motherboard/Code using that feature.

I also know that ESXi 6.5 had issues with AMD SMT for a long time, as it would PSOD with a different error message.  Recent patches fix that, and this same server ran on ESXi 6.5 with SMT on for longer than a few weeks ... granted I upgraded to 6.7 rather quickly after that.

0 Kudos