I did the best I could to capture the PSOD itself in the attached image. Just checking to see if anyone else may have encountered this or similar before.
This system is a Ryzen 1700 on an AB350 based motherboard. lspci shows this:
0000:00:00.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
0000:00:00.2 Generic system peripheral: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
0000:00:01.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
0000:00:01.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [PCIe RP[0000:00:01.1]]
0000:00:01.3 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [PCIe RP[0000:00:01.3]]
0000:00:02.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
0000:00:03.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
0000:00:03.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [PCIe RP[0000:00:03.1]]
0000:00:04.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
0000:00:07.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
0000:00:07.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [PCIe RP[0000:00:07.1]]
0000:00:08.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
0000:00:08.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [PCIe RP[0000:00:08.1]]
0000:00:14.0 Serial bus controller: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
0000:00:14.3 Bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
0000:00:18.0 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
0000:00:18.1 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
0000:00:18.2 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
0000:00:18.3 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
0000:00:18.4 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
0000:00:18.5 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
0000:00:18.6 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
0000:00:18.7 Bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
0000:01:00.0 Mass storage controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [vmhba1]
0000:02:00.0 Serial bus controller: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset USB 3.1 xHCI Controller
0000:02:00.1 Mass storage controller: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset SATA Controller [vmhba2]
0000:02:00.2 Bridge:
0000:03:00.0 Bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port
0000:03:01.0 Bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port
0000:03:04.0 Bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port
0000:04:00.0 Network controller: Realtek Semiconductor Co., Ltd. Onboard Ethernet
0000:06:00.0 Network controller: Intel Corporation PRO/1000 PT Dual Port Server Adapter [vmnic0]
0000:06:00.1 Network controller: Intel Corporation PRO/1000 PT Dual Port Server Adapter [vmnic1]
0000:07:00.0 Display controller: NVIDIA Corporation GK208B [GeForce GT 710]
0000:07:00.1 Multimedia controller: NVIDIA Corporation GK208 HDMI/DP Audio Controller
0000:08:00.0 :
0000:08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
0000:08:00.3 Serial bus controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller
0000:09:00.0 :
0000:09:00.2 Mass storage controller: Advanced Micro Devices Inc AMD FCH SATA Controller [AHCI Mode] [vmhba0]
I do have a Crucial BX300 SSD on the SATA connections. The odd thing about it is that it's showing latency in the order of 10-15 ms with minimal IO to that when in use. This system was stable under 6.5 and similar load/guest OSs (nested ESXi hosts), but that was strictly using the Samsung NVMe SSD + NAS storage (the crucial drive wasn't connected ever). I'm going to offload the Crucial disk and just use the NVMe Samsung and NAS again for a bit to see if that stabilizes the system for more than a week.
It took about 2-3 weeks of uptime to finally PSOD and produce the issue again (Second time) so I could actually capture data.
after talking around ... it seems like this may be due to the AMD style of Hyperthreading. There were issues with this in ESXi 6.5 not too long ago. I've disabled the SMT and seeing how that goes.
The problem may be related to the motherboard. Which AB350 is it exactly? I have had bad results with ASRock AB350M Pro 4 with Ryzen 1700 and esxi. The setup worked for the first month and then started to PSOD, reboot or freeze during idle but never when active.
The motherboard = Gigabyte AB350-M Gaming3
Disabling the SMT has returned stability to the ESXi host. Current uptime of over 24 days.
I think it narrows the issue to something with CPU/Motherboard/Code using that feature.
I also know that ESXi 6.5 had issues with AMD SMT for a long time, as it would PSOD with a different error message. Recent patches fix that, and this same server ran on ESXi 6.5 with SMT on for longer than a few weeks ... granted I upgraded to 6.7 rather quickly after that.