In case anyone wants some more details on the node's firmware versions.
I've also taken care of the LSI Firmware and Driver issue, mentioned in the KB below, on 9/10 - after it had crashed on 9/9.
Please file a SR as well with VMware. If you have all the latest firmware/driver then this should not be happening. I have already pointed VMware people to your thread, please post your SR number here as well as that makes it easier for VMware people if they want to help.
Dell Support sent out a backplane and PERC cable. The replacement backplane was faulty. Waiting on another backplane.
Also opened the VMware case for good measure. SR 15793487011
I know Dell has been advising specifically on their Intel re-branded drives (3610, 3710) to update the firmware on them. Has this happened yet?
On the second backplane Dell sent, I continued to get voltage errors on. System wouldn't boot at all with the newer backplane cards.
They believe the power cable may have been the issue. Just in case, I got the kitchen sink full of parts - including the motherboard.
Well, ended up being the motherboard causing the voltage error. Probably also caused the drives to reset. Suspect that the earlier revision backplane allowed for a higher tolerance of incorrect voltage.
I'll let you know if this resurfaces.
I haven't heard back from VM SR case yet. I have uploaded the logs.
Thank you all!
I know another VSAN R730 customer who Dell did the same replacement set on. Curious if there was a bad batch or something.
I've also heard anecdotal references from large Dell resellers about power when using the mid-plane with more power hungry drives.
Thanks for reporting this issue. I have discussed this issue with the VSAN hardware certification team. There is an updated version of the driver that the vendor says will handle this type of failure more gracefully, by preventing a PSOD. We have put this driver high on the priority list for certification testing and the VSAN HCL will be updated as soon as this is complete.
Another node in the VSAN just crashed with another PSOD referencing PF Exception 14 ... vmhba0.
On the phone with Dell Support and pulling logs.
Pulled this out of the vmkernel-zdump fragments
2015-11-05T00:11:49.003Z cpu1:5097074)lsi_mr3: getSpanDumpProgress: Faulting world regs Faulting world regs (01/13)
DumpProgress: Vmm code/data Vmm code/data (02/13)
DumpProgress: Vmk code/rodata/stack Vmk code/rodata/stack (03/13)
DumpProgress: Vmk data/heap Vmk data/heap (04/13)
DumpProgress: PCPU PCPU (05/13)
2015-11-05T00:14:04.079Z cpu19:33418)Dump: 3571: Dumped 2 pages of recentMappings
DumpProgress: World-specific data World-specific data (06/13)
DumpProgress: Xmap Xmap (07/13)
2015-11-05T00:14:41.782Z cpu19:33418)XMap: 1565: Dumped 104979 pages
DumpProgress: VASpace VASpace (08/13)
2015-11-05T00:14:41.824Z cpu19:33418)HeapMgr: 902: Dumping HeapMgr region with 49042 PDEs.
2015-11-05T00:22:31.398Z cpu19:33418)VAArray: 600: Dumping VAArray region
2015-11-05T00:22:31.400Z cpu19:33418)Timer: 1594: Dumping Timer region with 79 PDEs.
2015-11-05T00:22:31.610Z cpu19:33418)FastSlab: 1169: Dumping FastSlab region with 32768 PDEs.
2015-11-05T01:08:28.689Z cpu19:33418)MPage: 734: Dumping MPage region
2015-11-05T01:18:19.213Z cpu19:33418)VAArray: 600: Dumping VAArray region
2015-11-05T01:18:19.455Z cpu19:33418)PShare: 3133: Dumping pshareChains region with 16 PDEs.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "WorldStore" [439100000 - 439500001] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "memNodeLookup" [439540000 - 439540001] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "vmkStats" [439580000 - 439d80000] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "pageRetireBitmap" [439d80000 - 439d80c10] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "pageRetireBitmapIdx" [439dc0000 - 439dc0001] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "llswap" [43ae00000 - 43af21000] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "LPageStatus" [43af40000 - 43af40182] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)VASpace: 1101: VASpace "LSOMVaSpace" [43af80000 - 43ba20000] had no registered dump handler.
2015-11-05T01:18:51.690Z cpu19:33418)Migrate: 359: Dumping Migrate region with 49152 PDEs
2015-11-05T01:19:11.552Z cpu19:33418)VASpace: 1101: VASpace "XVMotion" [43d240000 - 43d260000] had no registered dump handler.
DumpProgress: PFrame PFrame (09/13)
2015-11-05T01:19:11.582Z cpu19:33418)PFrame: 3861: Dumping PFrame region with 197632 PDEs
[7m2015-11-05T01:20:26.785Z cpu19:33418)WARNING: Compress: 167: CompressCheckOutput failed: Limit exceeded [avail_in: 4096, avail_out: 0] [0m
[7m2015-11-05T01:20:26.785Z cpu19:33418)WARNING: Dump: 2676: Compression failure while dumping range 'PDirFrame' 4096 bytes: 0xbad0006: Limit exceeded [0m
[7m2015-11-05T01:20:26.785Z cpu19:33418)WARNING: Dump: 3087: Failed to flush compressed data with Limit exceeded [0m
Hi Justin, I got the same issue on the same hardware (unluckily it occurs the first day of the vmworld15 EU...)
VMware has NOT be able to solve this -beside telling me to go ask dell for the 6.607.08.00-2vmw.600.0.0.2817019 version of the lsi_mr3 driver- and dell is still trying to figure out...
Since the PSOD did not occured again, VMware closed the SR unsolved... Before that i got a lot of various issues invloving the storage stack.
Same as you, i got a lot of vob.scsi.scsipath.por before the crash.
Of course i was full up2date a the time of the PSOD (backplane, driver, firmware), since then a new firmware for the SSDs is out but i noticed we have the exact same HDDs with the same firmware. That could be the link.
Anyway, i tried those settings and got no more problems since then Dell R730/H730 P Raid Controller Firmware issues with All Flash VSAN – itsintehcloud
RS_1, thanks for contributing your experiences to the thread. The engineering team is very much aware of this issue and the fact that a fix exists in a later version of the lsi_mr3 driver than what is currently on the VSAN HCL. We are working towards updating the HCL with a newer driver that will resolve the PSOD. In the mean time, you can reduce the risk of this issue (and others) occurring by applying the remedies from the following KB to your R730 servers:
The PSOD is occurring in the task management code of the driver, so by updating all relevant firmware and drivers which improve the responsiveness and stability of the controller and the attached drives, you can reduce the risk of the PSOD occurring. This is of course not guaranteed, but a permanent fix is on the way.