What changed on the Host? any new hardware added?
No we didn't change anything.. its a completly new dell server.. the system came preinstalled with 6.7.0 and we just made our VM's running on it and set it up at our customers office. And now it crashed weekly with this error.
Double check and make sure that server has all the latest BIOS, firmware, etc. Depending on when it was built it could be missing some important microcode.
BIOS, firmware and drivers are already checked!
Next thing is to patch that ESXi host up to the latest available build. Based on the PSOD, you're on the GA release from back in April and there are two later builds which address issues. Highly recommend patching up to latest ASAP.
Going by the functions reported in the PSOD stack, looks like excessive logging is causing the host to crash. I'm assuming vmw_ahci is the driver catering to the local datastore on which the scratch partition is configured. Check if there is any debug/verbose level logging enabled for any of the components of the host such as hostd, vpxa, NIC/HBA driver, etc. If yes, change the logging level to Info/Warning.
Hey, thanks for you answer. Do you know how i can change the logging level?
Got the Same error on one out of 5 identical Dell T440 with the following CPUs (both Sockets are used)
Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz Model 85 Stepping 4
Been in contact with the Dell Support - this happens from ESXi 6.5 U1, U2, 6.7 GA, 6.7 U1 (It doesn't matter if i install with the original-vmware ISO or if i use the customized Dell ImagE).
One CPU has been changed but currently it didn't resolved the Case.
Ofcourse the IPMI of the Server shows that everything is ok.
It doesn't matter if there is load on the Server or if there is something happening. This Crash is reproducable and happens after atleast 2 days.
Did you come up with a resolution?
Meoli - were you able to find a resolution to this issue? We are having the same purple screen message with the following CPUs
Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz Model 85 Stepping 4
We need to look at the vmkernel.log just before the server fails with PSOD. Look for any errors. If you see entries or errors from "vmw_ahci", then check what is the controller that uses vmw_ahci. If its Intel SATA AHCI controller , find if its used by CDROM device using command esxcli storage core device list and also using KB : VMware Knowledge Base
If its using for CDROM, most likely its a failed CDROM device which cause the data transfer error between CD-ROM and AHCI controller, then the driver prints too much logs and cause the PCPU lockup.
Unfortunately the Respond by e-mail did not work
In our case it could only be resolved by replacing the motherboard. So it may be a socket Fault or something on the mainboard itself. Although it is strange that you find so much entries if you search for PsoDs with this CPUs...