Re: Is there a way to force or run anyway through ...

cypherx · ‎04-26-2018

We have some new ESXi 6.0 Dell servers for the past month and a half. Suddenly one of them PSOD around 1:12 Am last night towards the end of Veeam backups running. It had an uptime of 33 days. Thankfully HA / DRS powered on the VM's on other hosts, so downtime was minimal.

Today it seemed ok and chalking it up to maybe a heavy load during backup, we moved some light loads onto this host. It's been running for maybe an hour or two and now it PSOD again, and those light loads powered on other hosts.

Both instances it shows PF Exception 14.

At 1:12 AM it was PF Exception 14 in world 33433:vmnic0=pollW IP

Around 10:30 AM it's PF Exception 14 in world 45512:vmm0:adfs1 (adfs1 is the name of one of the VM's that was running on this).

Any ideas? Any secret flags to ignore these kind of conditions and force the kernel to keep "chugging along"?

Finikiez · ‎04-26-2018

Hi!

First of all you need to understand what caused your PSODs.

To do this show us PSOD screenshot, please.

cypherx · ‎04-26-2018

Ok attached are two screen shots. One is from 1:12 AM when i was using log me in iphone. The larger screen shot is from a few minutes ago.

I'm not sure how effective this would be, but I created a VM on it at 128GB and running memtest86+ on it. Though like I said its a VM so I'm not sure how effective its really checking memory.

The server itself has 256GB memory, as are 6 of ours (2 are 128GB). I didn't want to create a memtest86+ vm at 256GB if it does PSOD and then HA moves it and powers it on another host.

cypherx · ‎04-26-2018

Ok it just happened again.

My VM, HirensBootCD which launches that iso, in which I selected memtest86+ has been running for 20 minutes or so.

Just PSOD again...

PF Exception 14 in world 36803:vmx-mks:Hire IP 0x41801800464c4 addr 0x5b4

I will try to boot from an actual memtest86+ usb stick bare metal and try to test the memory outside of ESXi.

The server was working fine for 33 days. No changes or anything.

Finikiez · ‎04-26-2018

This PSOD looks similar to VMware Knowledge Base

If you have active support contract I recommend open a SR with vmware technical support for clarification.

cypherx · ‎04-26-2018

I'm on with Dell now to see if we can determine a hardware issue.

cypherx · ‎04-26-2018

So far 1 hour and 47 minutes in, memtest86 is not finding any errors. Will let it run overnight.

Finikiez · ‎04-27-2018

This looks as a ESXi bug rather then HW issue from my point of view.

cypherx · ‎04-27-2018

Ran memtest86x for 24 hours. No errors.

I agree could be an ESXi bug. However I have 3 other Dell FC640 blades running in the same chassis, all identically configured (same memory, networking, ESXi 6.0.0 build 7967664, same BIOS, same firmwares (at least I think - it all shipped together).

KNOCK ON WOOD... none of my other 3 Dell blades experienced this yet. So because of that I wanted to rule out hardware. ESXi runs off a microSD card, and I just finished setting up the core dump collector service to our vcenter server... so assuming the core doesn't happen on a host where vcenter is running off of, hopefully we get a real dump.

I sent vmware screen shots of the memtest86x. I kept them updated throughout the process, but the support at least if doing it online or via email is dreadfully SLOW. I will have to likely get on the phone later today.

They have the following:
Broadcom Corporation QLogic 57840 10Gigabit Ethernet Adapter (Quad port)Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet (Quad Port)

256 GB DDR4-2666

Intel Xeon Gold 6126 2.6G 12C/24T, 10.4GT/s, 19.25M Cache, Turbo, HT (125W)

Performance BIOS settings.

All 4 housed in a Dell FX2 2U Chassis.

cypherx · ‎04-27-2018

I brought the server back up, no VMs were running, I brought it up long enough to configure core dump to dump collector, also grab the copy the existing dump. It was up fro just over an hour idle, doing nothing, not running anything, and yet we have another exception 14.

cypherx · ‎04-28-2018

New one, 12:02 AM EST today 4/28/2018. DF Exception 8.

There was no load on this host. Since its instability we just boot it and let it run without any VM loads running on it. So it was sitting idle.

daphnissov · ‎04-28-2018

Open an SR if you haven't already even though this looks like hardware issues.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

cypherx · ‎04-28-2018

I do have one open. They are just extremely slow to respond. I understand they have a lot of information to go through. I have zdump of practicly every PSOD uploaded, as well as system logs extracted from client, along with all of the screenshots. Memtest86 ran cleanly for 24 hours, I don't think its memory related, though it looks like hardware to me. I have 3 other FC640 blades in this chassis, identically configured, all running production VM loads, no issues ::KNOCK ON WOOD::

The chassis and the 4 blades were all ordered together at the beginning of this year, so it all arrived in one piece.

cypherx · ‎05-08-2018

So this case is still open, numerous PSOD's and dumps later. Replaced both NICs. Still happens. They suggested reinstalling. Boot from the Dell ESXi installer, it gets like 10% of the way installing and BOOM. PSOD.

Can't wait to hear what they tell me next.

cypherx · ‎05-09-2018

Well although Memtest 4.x serires booted in BIOS mode can't find any issues, Memtst 7.5 booted in UEFI mode certainly finds tons of issues. Pressuring Dell to replace memory.

Finikiez · ‎05-09-2018

This could be a system board issue as well.

You can do the following - start the host with minimal configuration (for example leave one cpu and one memory module) then check if issue persist.

cypherx · ‎05-09-2018

After further testing it actually appears to be the second CPU.

If the second CPU is out of the system, all memtest86 7.5 checks good, specifically test 6 will show you in seconds if its bad or not.

Put that second CPU in CPU slot 1 and then memtest86 7.5 test 6 fails immediately.

I reinstalled ESXi, configured it, added it to vcenter, updated it all on one CPU and running a bunch of stress VM's and its rock solid.

Dell is going to RMA the CPU.

I also have Multi-bit memory errors on one DIMM. So looking at one CPU and one DIMM. This follows the DIMM (put it in another slot, BIOS and iDrac follow the DIMM)

The CPU is an Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz.

Finikiez · ‎05-09-2018

Glad to hear that you moved forward in resolution

Rohitsa · ‎05-13-2018

PF exception 14s is due to page fault, this could be due to both software and hardware problems. since your other blades have same configuration, i think it has more to do with your hardware..

you need to raise SR to diagnose core dumps , but you may want to try replacing ram in blade as this could be one of reason

this article explains further PF exception 13 and 14

VMware Knowledge Base

Rohitsa · ‎05-13-2018

Sorry didnt see till end,.. HAHA good to hear its all working good

All

Is there a way to force or run anyway through PSOD PF Exception 14's?