VMware Cloud Community
seanmcg182
Contributor
Contributor

PF Exception 14 in world

I hope I'm posting this in the right place...

Been running a server for a few months, so I'm very new to this whole ESXi thing, and just got my first PSOD... rebooted, and got another less than an hour later...

I was running ESXi 6.5U3, modified with drivers for realtek, and i think a 6.2 NVMe driver, as my original NVMe wasnt being recognized with the 6.5 drivers... maybe 3 weeks ago, i swapped that NVME with a newer one (I'll come back to that later)...

ESXI itself is installed on a USB Drive, the NVME is just a datastore for the VMs

After a few PSODs, I decided to try reinstalling ESXi V7U2, hoping that it would solve my issues... it did not

All these pictures are from when 6.5U3 were still installed.

 

I'm guessing its a hardware problem, but I do not know where to begin diagnosing which piece of hardware... everything in the case is less than 4 months old, and it was running fine for 3 months... 

 

The only things newer than everything else is 2 sticks of the RAM, 4HDDs for a VM-Hosted NAS (passthrough), an intel NIC, and the NVME Datastore...

I can't imagine the 4 HDD's would cause an ESXI crash, when they are just passed through to a VM for NAS, ESXI does not read from them at all, and the SAS Card they are connected to has been there since the beginning.

The NVME is a datastore for my 8VMs. Samsung EVO 970. been in there for 3 weeks. 

I originally started out with 2x16GB of ram, and upgraded to 4x16GB 3 weeks ago. I'm currently running a Memtest, but I feel like if there was a problem, It would have popped up a lot sooner than 3 weeks.

 

The intel NIC I only installed when I moved to 7U2, which was already after the PSOD's started

Any ideas for diagnosis/what to replace first?

image2.jpg

image0.jpg

0 Kudos
4 Replies
vbondzio
VMware Employee
VMware Employee

Looks "hardwarey". Is SMT enabled? (i.e. are CPU 2 / 3 the same core) Do you have more screenshots to check for where the recursive panic is happening? Does the host have two different sockets? If yes, can you switch them and does the issue follow?

0 Kudos
Redhatcc
Enthusiast
Enthusiast

When you reimaged did you use the vendor specific ISO? Such as if it was a Dell R730, you would use the Dell EMC custom ISO to initially install the OS, then minor patches to get it up to the latest and greatest. 

0 Kudos
Ke4
Contributor
Contributor

Hello.

Did you find the problem?

It looks like I have the same problem.

ESXi 6.7 with NVMe intel driver and last updates intalled on usb flash

raid1 NVMe

Psod began to appear a month after installation.

0 Kudos
seanmcg182
Contributor
Contributor

Hi, So in my case, it was kinda a crapshoot… I went from one issue to the next without realizing they were different issues.

I BELIEVE my original issue, was those 2 new RAM sticks I had added. Upon further spec research, they were a different timing than the original 2. The were the same Brand and Series and Size… but apparently slightly different models.

It is also possible that my non-default drivers exacerbated the issue.

 

I was also encountering an issue where ESXi would partially brick the flash drive, essentially turning it read-only, which caused a near identical PSOD. It would boot fine, and run for a while then the PSODs would return… This was patched in 7.0U2c i believe? Or maybe U3, I dont remember tbh .

 

Between a new Flash Drive, fixing my mismatched RAM timings, and using a fresh unmodified install of ESXi 7.0U3, I haven't had a PSOD in maybe 4 months now?

 

I’m unsure which update the USB issue first appeared in, but if your issue randomly appeared without any hardware changes, that might be where I’d start.

 

How long does the server run for (if at all) before crashing? If its able to run for a few hours, try checking the Host > Monitor > Events section… The USB issue was presenting itself with a warning along the lines of “BOOTBANK has locked up” or something… Sorry, after all this time I don't remember the exact error.

 

If you are getting an error about a volume locking up, it would sound like its the USB issue, which the only solution I know about is a new USB and a fresh install… but unless you update to 7.0U3, it would reappear in time

 

Good thing about ESXi is if you have the VMs on the NVMe, re-importing them after an ESXi reinstall is easy