VSAN Node Crashed - R730xd - PF Exception 14 in wo...

justinbennett · ‎11-02-2015

Anyone had a similar issue?

Host has a PERC H730p controller. Looks like the disks were resetting prior to the crash according to the system's lifecycle controller.

On call with Dell Support. Planning on a VMware support call too.

Thanks in advance!

Justin

justinbennett · ‎11-03-2015

In case anyone wants some more details on the node's firmware versions.

I've also taken care of the LSI Firmware and Driver issue, mentioned in the KB below, on 9/10 - after it had crashed on 9/9.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=210966...

depping · ‎11-03-2015

Please file a SR as well with VMware. If you have all the latest firmware/driver then this should not be happening. I have already pointed VMware people to your thread, please post your SR number here as well as that makes it easier for VMware people if they want to help.

justinbennett · ‎11-03-2015

Dell Support sent out a backplane and PERC cable. The replacement backplane was faulty. Waiting on another backplane.

Also opened the VMware case for good measure. SR 15793487011

Thank you!

Justin

justinbennett · ‎11-03-2015

More info - software VIB's loaded...

JohnNicholsonVM · ‎11-03-2015

I know Dell has been advising specifically on their Intel re-branded drives (3610, 3710) to update the firmware on them. Has this happened yet?

justinbennett · ‎11-03-2015

I'm running Seagate 10K SAS and Toshiba SSD SAS drives.

justinbennett · ‎11-03-2015

UPDATE:

On the second backplane Dell sent, I continued to get voltage errors on. System wouldn't boot at all with the newer backplane cards.

They believe the power cable may have been the issue. Just in case, I got the kitchen sink full of parts - including the motherboard.

Well, ended up being the motherboard causing the voltage error. Probably also caused the drives to reset. Suspect that the earlier revision backplane allowed for a higher tolerance of incorrect voltage.

I'll let you know if this resurfaces.

I haven't heard back from VM SR case yet. I have uploaded the logs.

Thank you all!

-Justin

justinbennett · ‎11-03-2015

Oh... here was the fun...

JohnNicholsonVM · ‎11-03-2015

I know another VSAN R730 customer who Dell did the same replacement set on. Curious if there was a bad batch or something.

I've also heard anecdotal references from large Dell resellers about power when using the mid-plane with more power hungry drives.

cdekter · ‎11-04-2015

Hi Justin,

Thanks for reporting this issue. I have discussed this issue with the VSAN hardware certification team. There is an updated version of the driver that the vendor says will handle this type of failure more gracefully, by preventing a PSOD. We have put this driver high on the priority list for certification testing and the VSAN HCL will be updated as soon as this is complete.

justinbennett · ‎11-04-2015

Another node in the VSAN just crashed with another PSOD referencing PF Exception 14 ... vmhba0.

On the phone with Dell Support and pulling logs.

justinbennett · ‎11-04-2015

Pulled this out of the vmkernel-zdump fragments

2015-11-05T00:11:49.003Z cpu1:5097074)lsi_mr3: getSpanDumpProgress: Faulting world regs Faulting world regs (01/13)

DumpProgress: Vmm code/data Vmm code/data (02/13)

DumpProgress: Vmk code/rodata/stack Vmk code/rodata/stack (03/13)

DumpProgress: Vmk data/heap Vmk data/heap (04/13)

DumpProgress: PCPU PCPU (05/13)

2015-11-05T00:14:04.079Z cpu19:33418)Dump: 3571: Dumped 2 pages of recentMappings

DumpProgress: World-specific data World-specific data (06/13)

DumpProgress: Xmap Xmap (07/13)

2015-11-05T00:14:41.782Z cpu19:33418)XMap: 1565: Dumped 104979 pages

DumpProgress: VASpace VASpace (08/13)

2015-11-05T00:14:41.824Z cpu19:33418)HeapMgr: 902: Dumping HeapMgr region with 49042 PDEs.

2015-11-05T00:22:31.398Z cpu19:33418)VAArray: 600: Dumping VAArray region

2015-11-05T00:22:31.400Z cpu19:33418)Timer: 1594: Dumping Timer region with 79 PDEs.

2015-11-05T00:22:31.610Z cpu19:33418)FastSlab: 1169: Dumping FastSlab region with 32768 PDEs.

2015-11-05T01:08:28.689Z cpu19:33418)MPage: 734: Dumping MPage region

2015-11-05T01:18:19.213Z cpu19:33418)VAArray: 600: Dumping VAArray region

2015-11-05T01:18:19.455Z cpu19:33418)PShare: 3133: Dumping pshareChains region with 16 PDEs.

2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "WorldStore" [439100000 - 439500001] had no registered dump handler.

2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "memNodeLookup" [439540000 - 439540001] had no registered dump handler.

2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "vmkStats" [439580000 - 439d80000] had no registered dump handler.

2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "pageRetireBitmap" [439d80000 - 439d80c10] had no registered dump handler.

2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "pageRetireBitmapIdx" [439dc0000 - 439dc0001] had no registered dump handler.

2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "llswap" [43ae00000 - 43af21000] had no registered dump handler.

2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "LPageStatus" [43af40000 - 43af40182] had no registered dump handler.

2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "LSOMVaSpace" [43af80000 - 43ba20000] had no registered dump handler.

2015-11-05T01:18:51.690Z cpu19:33418)Migrate: 359: Dumping Migrate region with 49152 PDEs

2015-11-05T01:19:11.552Z cpu19:33418)VASpace: 1101: VASpace "XVMotion" [43d240000 - 43d260000] had no registered dump handler.

DumpProgress: PFrame PFrame (09/13)

2015-11-05T01:19:11.582Z cpu19:33418)PFrame: 3861: Dumping PFrame region with 197632 PDEs

[7m2015-11-05T01:20:26.785Z cpu19:33418)WARNING: Compress: 167: CompressCheckOutput failed: Limit exceeded [avail_in: 4096, avail_out: 0] [0m

[7m2015-11-05T01:20:26.785Z cpu19:33418)WARNING: Dump: 2676: Compression failure while dumping range 'PDirFrame' 4096 bytes: 0xbad0006: Limit exceeded [0m

[7m2015-11-05T01:20:26.785Z cpu19:33418)WARNING: Dump: 3087: Failed to flush compressed data with Limit exceeded [0m

RS_1 · ‎11-11-2015

Hi Justin, I got the same issue on the same hardware (unluckily it occurs the first day of the vmworld15 EU...)

VMware has NOT be able to solve this -beside telling me to go ask dell for the 6.607.08.00-2vmw.600.0.0.2817019 version of the lsi_mr3 driver- and dell is still trying to figure out...

Since the PSOD did not occured again, VMware closed the SR unsolved... Before that i got a lot of various issues invloving the storage stack.

Same as you, i got a lot of vob.scsi.scsipath.por before the crash.

Of course i was full up2date a the time of the PSOD (backplane, driver, firmware), since then a new firmware for the SSDs is out but i noticed we have the exact same HDDs with the same firmware. That could be the link.

Anyway, i tried those settings and got no more problems since then Dell R730/H730 P Raid Controller Firmware issues with All Flash VSAN – itsintehcloud

cdekter · ‎11-11-2015

RS_1, thanks for contributing your experiences to the thread. The engineering team is very much aware of this issue and the fact that a fix exists in a later version of the lsi_mr3 driver than what is currently on the VSAN HCL. We are working towards updating the HCL with a newer driver that will resolve the PSOD. In the mean time, you can reduce the risk of this issue (and others) occurring by applying the remedies from the following KB to your R730 servers:

VMware KB: Using a Dell Perc H730 controller in an ESXi 5.5 or ESXi 6.0 host displays IO failures or...

The PSOD is occurring in the task management code of the driver, so by updating all relevant firmware and drivers which improve the responsiveness and stability of the controller and the attached drives, you can reduce the risk of the PSOD occurring. This is of course not guaranteed, but a permanent fix is on the way.

RS_1 · ‎11-11-2015

Thanks cdekter but before the kb was published i was in the latest version on all the components (this was validate by GSS) and nobody told me that the engineering team was aware of this issue, far from that actually. I was only told about this famous version of the driver nobody can found. Now there is a newest driver (Avago_bootbank_lsi-mr3_6.609.08.00-1OEM.600.0.0.2768847) but again nobody could tell me is this safe to use it on a VSAN cluster or not. you can check SR#15778351510 if you'd like.

cdekter · ‎11-13-2015

Hi RS_1, I reviewed your support case - unfortunately you were given inadequately vetted advice regarding the issue. The support person was referring to an old bug report for a similar, but not identical crash. The investigation for the crash you encountered is still ongoing with the help of the hardware vendor, as this crash is occurring in the controller driver code. In the mean time, the standard advice is applicable which is to use the driver and firmware versions that are published on the VSAN HCL.

RS_1 · ‎11-21-2015

Thanks cdekter, and what is the unstandard advice when you got the drivers and firmwares listed in the HCL but got PSOD ?

joergriether · ‎11-22-2015

I don´t think this is important for YOUR case bust just to be 100% sure you got this - it was reported to us by DELL on 05th of November via mail, i quote:

Issue, Root Cause:

Toshiba M2, M2+ and M2R SAS Solid State Drives (SSDs) will experience significant performance throttling after 4320 power on hours (~180 days).

A defect in the SSD endurance management firmware activates performance throttling at 4320 power on hours. Following enablement, performance is reduced 5% every additional 24 power on hours.

Issue occurrence is 100% after 4320 power on hours

Affected ship date range: RSL (~mid 2014) – June 3, 2015

KCS SLN297281 published Oct 17, 2014 / Delta tag messaging in place

Technical Fix:

update SSD FW to A(3/4/5)AE or later

RS_1 · ‎11-23-2015

Thanks a lot joe, i already encouter the issue related to this firmware problem, all the SSDs are in A3E since then. BTW it does not cause a PSOD.

All

VSAN Node Crashed - R730xd - PF Exception 14 in world 33571: Cmpl-vmhba0- IP 0x41802c3abd44 addr 0x50