Anyone had a similar issue?
Host has a PERC H730p controller. Looks like the disks were resetting prior to the crash according to the system's lifecycle controller.
On call with Dell Support. Planning on a VMware support call too.
Thanks in advance!
In case anyone wants some more details on the node's firmware versions.
I've also taken care of the LSI Firmware and Driver issue, mentioned in the KB below, on 9/10 - after it had crashed on 9/9.
Please file a SR as well with VMware. If you have all the latest firmware/driver then this should not be happening. I have already pointed VMware people to your thread, please post your SR number here as well as that makes it easier for VMware people if they want to help.
Dell Support sent out a backplane and PERC cable. The replacement backplane was faulty. Waiting on another backplane.
Also opened the VMware case for good measure. SR 15793487011
On the second backplane Dell sent, I continued to get voltage errors on. System wouldn't boot at all with the newer backplane cards.
They believe the power cable may have been the issue. Just in case, I got the kitchen sink full of parts - including the motherboard.
Well, ended up being the motherboard causing the voltage error. Probably also caused the drives to reset. Suspect that the earlier revision backplane allowed for a higher tolerance of incorrect voltage.
I'll let you know if this resurfaces.
I haven't heard back from VM SR case yet. I have uploaded the logs.
Thank you all!
I know another VSAN R730 customer who Dell did the same replacement set on. Curious if there was a bad batch or something.
I've also heard anecdotal references from large Dell resellers about power when using the mid-plane with more power hungry drives.
Thanks for reporting this issue. I have discussed this issue with the VSAN hardware certification team. There is an updated version of the driver that the vendor says will handle this type of failure more gracefully, by preventing a PSOD. We have put this driver high on the priority list for certification testing and the VSAN HCL will be updated as soon as this is complete.
Pulled this out of the vmkernel-zdump fragments
2015-11-05T00:11:49.003Z cpu1:5097074)lsi_mr3: getSpanDumpProgress: Faulting world regs Faulting world regs (01/13)
DumpProgress: Vmm code/data Vmm code/data (02/13)
DumpProgress: Vmk code/rodata/stack Vmk code/rodata/stack (03/13)
DumpProgress: Vmk data/heap Vmk data/heap (04/13)
DumpProgress: PCPU PCPU (05/13)
2015-11-05T00:14:04.079Z cpu19:33418)Dump: 3571: Dumped 2 pages of recentMappings
DumpProgress: World-specific data World-specific data (06/13)
DumpProgress: Xmap Xmap (07/13)
2015-11-05T00:14:41.782Z cpu19:33418)XMap: 1565: Dumped 104979 pages
DumpProgress: VASpace VASpace (08/13)
2015-11-05T00:14:41.824Z cpu19:33418)HeapMgr: 902: Dumping HeapMgr region with 49042 PDEs.
2015-11-05T00:22:31.398Z cpu19:33418)VAArray: 600: Dumping VAArray region
2015-11-05T00:22:31.400Z cpu19:33418)Timer: 1594: Dumping Timer region with 79 PDEs.
2015-11-05T00:22:31.610Z cpu19:33418)FastSlab: 1169: Dumping FastSlab region with 32768 PDEs.
2015-11-05T01:08:28.689Z cpu19:33418)MPage: 734: Dumping MPage region
2015-11-05T01:18:19.213Z cpu19:33418)VAArray: 600: Dumping VAArray region
2015-11-05T01:18:19.455Z cpu19:33418)PShare: 3133: Dumping pshareChains region with 16 PDEs.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "WorldStore" [439100000 - 439500001] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "memNodeLookup" [439540000 - 439540001] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "vmkStats" [439580000 - 439d80000] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "pageRetireBitmap" [439d80000 - 439d80c10] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "pageRetireBitmapIdx" [439dc0000 - 439dc0001] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "llswap" [43ae00000 - 43af21000] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "LPageStatus" [43af40000 - 43af40182] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "LSOMVaSpace" [43af80000 - 43ba20000] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)Migrate: 359: Dumping Migrate region with 49152 PDEs
2015-11-05T01:19:11.552Z cpu19:33418)VASpace: 1101: VASpace "XVMotion" [43d240000 - 43d260000] had no registered dump handler.
DumpProgress: PFrame PFrame (09/13)
2015-11-05T01:19:11.582Z cpu19:33418)PFrame: 3861: Dumping PFrame region with 197632 PDEs
[7m2015-11-05T01:20:26.785Z cpu19:33418)WARNING: Compress: 167: CompressCheckOutput failed: Limit exceeded [avail_in: 4096, avail_out: 0] [0m
[7m2015-11-05T01:20:26.785Z cpu19:33418)WARNING: Dump: 2676: Compression failure while dumping range 'PDirFrame' 4096 bytes: 0xbad0006: Limit exceeded [0m
[7m2015-11-05T01:20:26.785Z cpu19:33418)WARNING: Dump: 3087: Failed to flush compressed data with Limit exceeded [0m
Hi Justin, I got the same issue on the same hardware (unluckily it occurs the first day of the vmworld15 EU...)
VMware has NOT be able to solve this -beside telling me to go ask dell for the 6.607.08.00-2vmw.600.0.0.2817019 version of the lsi_mr3 driver- and dell is still trying to figure out...
Since the PSOD did not occured again, VMware closed the SR unsolved... Before that i got a lot of various issues invloving the storage stack.
Same as you, i got a lot of vob.scsi.scsipath.por before the crash.
Of course i was full up2date a the time of the PSOD (backplane, driver, firmware), since then a new firmware for the SSDs is out but i noticed we have the exact same HDDs with the same firmware. That could be the link.
Anyway, i tried those settings and got no more problems since then Dell R730/H730 P Raid Controller Firmware issues with All Flash VSAN – itsintehcloud
RS_1, thanks for contributing your experiences to the thread. The engineering team is very much aware of this issue and the fact that a fix exists in a later version of the lsi_mr3 driver than what is currently on the VSAN HCL. We are working towards updating the HCL with a newer driver that will resolve the PSOD. In the mean time, you can reduce the risk of this issue (and others) occurring by applying the remedies from the following KB to your R730 servers:
The PSOD is occurring in the task management code of the driver, so by updating all relevant firmware and drivers which improve the responsiveness and stability of the controller and the attached drives, you can reduce the risk of the PSOD occurring. This is of course not guaranteed, but a permanent fix is on the way.
Thanks cdekter but before the kb was published i was in the latest version on all the components (this was validate by GSS) and nobody told me that the engineering team was aware of this issue, far from that actually. I was only told about this famous version of the driver nobody can found. Now there is a newest driver (Avago_bootbank_lsi-mr3_6.609.08.00-1OEM.600.0.0.2768847) but again nobody could tell me is this safe to use it on a VSAN cluster or not. you can check SR#15778351510 if you'd like.
Hi RS_1, I reviewed your support case - unfortunately you were given inadequately vetted advice regarding the issue. The support person was referring to an old bug report for a similar, but not identical crash. The investigation for the crash you encountered is still ongoing with the help of the hardware vendor, as this crash is occurring in the controller driver code. In the mean time, the standard advice is applicable which is to use the driver and firmware versions that are published on the VSAN HCL.
I don´t think this is important for YOUR case bust just to be 100% sure you got this - it was reported to us by DELL on 05th of November via mail, i quote:
Issue, Root Cause:
Toshiba M2, M2+ and M2R SAS Solid State Drives (SSDs) will experience significant performance throttling after 4320 power on hours (~180 days).
A defect in the SSD endurance management firmware activates performance throttling at 4320 power on hours. Following enablement, performance is reduced 5% every additional 24 power on hours.
Issue occurrence is 100% after 4320 power on hours
Affected ship date range: RSL (~mid 2014) – June 3, 2015
KCS SLN297281 published Oct 17, 2014 / Delta tag messaging in place
update SSD FW to A(3/4/5)AE or later