We have some new ESXi 6.0 Dell servers for the past month and a half. Suddenly one of them PSOD around 1:12 Am last night towards the end of Veeam backups running. It had an uptime of 33 days. Thankfully HA / DRS powered on the VM's on other hosts, so downtime was minimal.
Today it seemed ok and chalking it up to maybe a heavy load during backup, we moved some light loads onto this host. It's been running for maybe an hour or two and now it PSOD again, and those light loads powered on other hosts.
Both instances it shows PF Exception 14.
At 1:12 AM it was PF Exception 14 in world 33433:vmnic0=pollW IP
Around 10:30 AM it's PF Exception 14 in world 45512:vmm0:adfs1 (adfs1 is the name of one of the VM's that was running on this).
Any ideas? Any secret flags to ignore these kind of conditions and force the kernel to keep "chugging along"?
Ok attached are two screen shots. One is from 1:12 AM when i was using log me in iphone. The larger screen shot is from a few minutes ago.
I'm not sure how effective this would be, but I created a VM on it at 128GB and running memtest86+ on it. Though like I said its a VM so I'm not sure how effective its really checking memory.
The server itself has 256GB memory, as are 6 of ours (2 are 128GB). I didn't want to create a memtest86+ vm at 256GB if it does PSOD and then HA moves it and powers it on another host.
Ok it just happened again.
My VM, HirensBootCD which launches that iso, in which I selected memtest86+ has been running for 20 minutes or so.
Just PSOD again...
PF Exception 14 in world 36803:vmx-mks:Hire IP 0x41801800464c4 addr 0x5b4
I will try to boot from an actual memtest86+ usb stick bare metal and try to test the memory outside of ESXi.
The server was working fine for 33 days. No changes or anything.
Ran memtest86x for 24 hours. No errors.
I agree could be an ESXi bug. However I have 3 other Dell FC640 blades running in the same chassis, all identically configured (same memory, networking, ESXi 6.0.0 build 7967664, same BIOS, same firmwares (at least I think - it all shipped together).
KNOCK ON WOOD... none of my other 3 Dell blades experienced this yet. So because of that I wanted to rule out hardware. ESXi runs off a microSD card, and I just finished setting up the core dump collector service to our vcenter server... so assuming the core doesn't happen on a host where vcenter is running off of, hopefully we get a real dump.
I sent vmware screen shots of the memtest86x. I kept them updated throughout the process, but the support at least if doing it online or via email is dreadfully SLOW. I will have to likely get on the phone later today.
They have the following:
Broadcom Corporation QLogic 57840 10Gigabit Ethernet Adapter (Quad port)Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet (Quad Port)
256 GB DDR4-2666
Intel Xeon Gold 6126 2.6G 12C/24T, 10.4GT/s, 19.25M Cache, Turbo, HT (125W)
Performance BIOS settings.
All 4 housed in a Dell FX2 2U Chassis.
I brought the server back up, no VMs were running, I brought it up long enough to configure core dump to dump collector, also grab the copy the existing dump. It was up fro just over an hour idle, doing nothing, not running anything, and yet we have another exception 14.
New one, 12:02 AM EST today 4/28/2018. DF Exception 8.
There was no load on this host. Since its instability we just boot it and let it run without any VM loads running on it. So it was sitting idle.
I do have one open. They are just extremely slow to respond. I understand they have a lot of information to go through. I have zdump of practicly every PSOD uploaded, as well as system logs extracted from client, along with all of the screenshots. Memtest86 ran cleanly for 24 hours, I don't think its memory related, though it looks like hardware to me. I have 3 other FC640 blades in this chassis, identically configured, all running production VM loads, no issues ::KNOCK ON WOOD::
The chassis and the 4 blades were all ordered together at the beginning of this year, so it all arrived in one piece.
So this case is still open, numerous PSOD's and dumps later. Replaced both NICs. Still happens. They suggested reinstalling. Boot from the Dell ESXi installer, it gets like 10% of the way installing and BOOM. PSOD.
Can't wait to hear what they tell me next.
This could be a system board issue as well.
You can do the following - start the host with minimal configuration (for example leave one cpu and one memory module) then check if issue persist.
After further testing it actually appears to be the second CPU.
If the second CPU is out of the system, all memtest86 7.5 checks good, specifically test 6 will show you in seconds if its bad or not.
Put that second CPU in CPU slot 1 and then memtest86 7.5 test 6 fails immediately.
I reinstalled ESXi, configured it, added it to vcenter, updated it all on one CPU and running a bunch of stress VM's and its rock solid.
Dell is going to RMA the CPU.
I also have Multi-bit memory errors on one DIMM. So looking at one CPU and one DIMM. This follows the DIMM (put it in another slot, BIOS and iDrac follow the DIMM)
The CPU is an Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz.
PF exception 14s is due to page fault, this could be due to both software and hardware problems. since your other blades have same configuration, i think it has more to do with your hardware..
you need to raise SR to diagnose core dumps , but you may want to try replacing ram in blade as this could be one of reason
this article explains further PF exception 13 and 14