VMware Cloud Community
LucaObermeier
Contributor
Contributor

Strange error on ESXi 6.7.0

Hey guys,

the server crashed 1-2 times per week and then this error comes up. Can someone explain this to me? I've never seen this before.

CPU: 2x Intel Xeon Silver 4110 2.1G, 8C/16T, 9.6GT/s, 11M Cache

Tags (4)
Reply
0 Kudos
11 Replies
EricChigoz
Enthusiast
Enthusiast

Hello Luca,

What changed on the Host? any  new hardware added?

Find this helpful? Please award points. Thank you !
Reply
0 Kudos
LucaObermeier
Contributor
Contributor

No we didn't change anything.. its a completly new dell server.. the system came preinstalled with 6.7.0 and we just made our VM's running on it and set it up at our customers office. And now it crashed weekly with this error.

Reply
0 Kudos
daphnissov
Immortal
Immortal

Double check and make sure that server has all the latest BIOS, firmware, etc. Depending on when it was built it could be missing some important microcode.

Reply
0 Kudos
LucaObermeier
Contributor
Contributor

BIOS, firmware and drivers are already checked!

Reply
0 Kudos
daphnissov
Immortal
Immortal

Next thing is to patch that ESXi host up to the latest available build. Based on the PSOD, you're on the GA release from back in April and there are two later builds which address issues. Highly recommend patching up to latest ASAP.

Reply
0 Kudos
SupreetK
Commander
Commander

Going by the functions reported in the PSOD stack, looks like excessive logging is causing the host to crash. I'm assuming vmw_ahci is the driver catering to the local datastore on which the scratch partition is configured. Check if there is any debug/verbose level logging enabled for any of the components of the host such as hostd, vpxa, NIC/HBA driver, etc. If yes, change the logging level to Info/Warning.

Cheers,

Supreet

Reply
0 Kudos
LucaObermeier
Contributor
Contributor

Hey, thanks for you answer. Do you know how i can change the logging level?

Reply
0 Kudos
meoli
Enthusiast
Enthusiast

Got the Same error on one out of 5 identical Dell T440 with the following CPUs (both Sockets are used)

Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHzModel 85 Stepping 4

Been in contact with the Dell Support - this happens from ESXi 6.5 U1, U2, 6.7 GA, 6.7 U1 (It doesn't matter if i install with the original-vmware ISO or if i use the customized Dell ImagE).

One CPU has been changed but currently it didn't resolved the Case.

Ofcourse the IPMI of the Server shows that everything is ok.

It doesn't matter if there is load on the Server or if there is something happening. This Crash is reproducable and happens after atleast 2 days.

LucaObermeier

Did you come up with a resolution?

Reply
0 Kudos
PCGuyLLC
Contributor
Contributor

Meoli - were you able to find a resolution to this issue?  We are having the same purple screen message with the following CPUs

Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz  Model 85 Stepping 4

Reply
0 Kudos
devakumar
VMware Employee
VMware Employee

We need to look at the vmkernel.log just before the server  fails with PSOD. Look for any errors. If you see entries or errors from "vmw_ahci", then check what is the controller that uses vmw_ahci. If its Intel SATA AHCI controller , find if its used by CDROM device using command esxcli storage core device list and also using KB : VMware Knowledge Base

If its using for CDROM, most likely its a failed CDROM device  which cause the data transfer error between CD-ROM and AHCI controller, then the driver prints too much logs and cause the PCPU lockup.

Thanks

Reply
0 Kudos
meoli
Enthusiast
Enthusiast

Unfortunately the Respond by e-mail did not work Smiley Sad

In our case it could only be resolved by replacing the motherboard. So it may be a socket Fault or something on the mainboard itself. Although it is strange that you find so much entries if you search for PsoDs with this CPUs...

Best regards!

meoli

Reply
0 Kudos