Anyone had a similar issue?
Host has a PERC H730p controller. Looks like the disks were resetting prior to the crash according to the system's lifecycle controller.
On call with Dell Support. Planning on a VMware support call too.
Thanks in advance!
Justin
I have had the same issue occur with PF Exception 14 PSOD on same hardware. Case open with both Dell and VMWare. Dell closed the case saying they won't support me because i'm using 3rd party drives even though the drives are on VSAN HCL. This crashing doesn't occur often, probably see one node with the issue every 35 days or so but it still sucks.
Relevant errors from vmkernel:
2015-11-25T06:50:02.707Z cpu32:33230)lsi_mr3: mfi_TaskMgmt:259: Processing taskMgmt abort for device: vmhba0:C0:T12:L0
2015-11-25T06:50:02.707Z cpu32:33230)lsi_mr3: mfi_TaskMgmt:267: ABORT
2015-11-25T06:50:03.705Z cpu20:32875)lsi_mr3: mfi_TaskMgmt:259: Processing taskMgmt virt reset for device: vmhba0:C0:T12:L0
2015-11-25T06:50:03.705Z cpu20:32875)lsi_mr3: mfi_TaskMgmt:263: VIRT_RESET cmd # -1561398661
2015-11-25T06:50:03.705Z cpu20:32875)lsi_mr3: mfi_TaskMgmt:267: ABORT
2015-11-25T06:50:03.709Z cpu32:33230)lsi_mr3: fusionWaitForOutstanding:2573: megasas: [ 0]waiting for 55 commands to complete
2015-11-25T06:50:05.568Z cpu13:33507)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x28 (0x439fa0228800, 0) to dev "naa.50000c0f023f6cd8" on path "vmhba0:C0:T11:L0" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2015-11-25T06:50:05.568Z cpu13:33507)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.50000c0f023f6cd8" state in doubt; requested fast path state update...
2015-11-25T06:50:05.568Z cpu13:33507)ScsiDeviceIO: 2607: Cmd(0x439fa0228800) 0x28, CmdSN 0xc096 from world 0 to dev "naa.50000c0f023f6cd8" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
The vmhba0:c0:T12:L0 device happens to be a Dell branded Seagate 2.5" 7200 RPM SAS drive, and this seems to pop up first across all my crashes. Dell just points the finger at 3rd party non Dell drives being the issue though which I think is just BS.
Hi elerium,
Let me get in touch with our Dell engineering contacts regarding this. I believe I'm familiar with the case but just to validate, could you please post the VMware case number?
The logs are not typical for what I have seen in other cases encountering this PSOD. Do you have any VMFS volumes on any drives attached to the H730 controller? If so please refer to the following KB: VMware KB: Deployment guidelines for running VMware Virtual SAN and VMware vSphere VMFS datastores o...
Here is another KB regarding the known drive failure issue with the H730 controller: VMware KB: Avoiding a known drive failure issue when Dell PERC H730 controller is used with VMware V...
If you implement the recommendations from these two KB articles it may help mitigate the PSOD for the time being.
case 15815551311, I have followed KB 2135494 for drivers/firmware already.
I haven't seen KB 2136374 regarding VMFS + VSAN on the same controller but it does describe the setup we are using. Our setup is HBA using 1 drive for VMFS for coredump/scratch partition and remaining 14 drives for VSAN.
Edit: removed the VMFS coredump/scratch drive and am using network syslog and network coredump. Given the error patterns where the VMFS coredump/scratch drive would reset followed by raid crashes, I'm pretty hopeful that this fixes my PSOD and drive dropout problems on the H730 controller.
Thank you again cdekter.
Hi elerium,
Just saw your edit - your assessment based on the log messages that the VMFS volume was the trigger seems likely to be accurate to me. We are actively working with Dell on resolving the issues that manifest when using VMFS alongside VSAN on this controller, and hope to have a fix in the near future.
hi RS_1, same on me. In the end VMware Support told me, my PSOD occurs because of a problem with lsi_mr3 version 6.606.12.00-1OEM.
Problem should be fixed with version 6.608.11.00-1OEM, but driver is still not on the HCL. Hopefully vSAN HCL will be updated soon.
Experienced my first crash that looks like this issue today - adding here for reference.
That can't be right, can it? From what I can see, that newer fixed driver has been available for several months.
We're on are second PSOD and three all disk group failures within the last month. We also have the lsi_mr3 (6.606.12.00-1OEM.600.0.0.2159203) driver, along the recommended firmware updates and drivers. I'm blasting ESXi600-201601001 and BIOS update 1.5.4 to all my nodes.
The current VSAN recommended driver is actually pretty old, there are quite a few newer drivers since then, but none have been given okayed for VSAN HCL. In addition, from what I can see, some of these newer drivers are designed for use with firmwares that aren't available from Dell (but have been published by LSI) 24.9.0-0022
Current VSAN HCL H730 driver:
6.606.12.00-1OEM (05-19-2015) https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI60-LSI-LSI-MR3-66061200-1OEM&productId...
Other LSI drivers (same productId), but not qualified for VSAN or may also require firmwares that aren't available from Dell:
6.608.09.00-1OEM (06-24-2015) https://my.vmware.com/group/vmware/details?productId=491&downloadGroup=DT-ESXI60-LSI-LSI-MR3-6608090...
6.609.08.00-1OEM (09-30-2015) https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXI60-LSI-LSI-MR3-66090800-1OEM&product...
6.608.11.00-1OEM (10-23-2015) https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXI60-LSI-LSI-MR3-66081100-1OEM&product...
6.608.12.00-1OEM (11-27-2015) https://my.vmware.com/group/vmware/details?productId=491&downloadGroup=DT-ESX60-LSI-LSI-MR3-66081200...
6.610.17.00-1OEM (01-01-2016) https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI60-LSI-LSI-MR3-66101700-1OEM&productId...
I do hope we get an update from VMware or Dell soon regarding these. When the problem manifests, I see hundreds of disks resets in a short timeframe from iDrac/lifecycle controller, followed by the raid card being non-responsive which ends up dropping the disk group. I haven't seen the PSOD since my last post after following cdekter's advice about not having a VMFS on the same machine, however I still experience dropping of disk groups. This happens infrequently but it still happens maybe every 35 days for me. The last crash I had a week ago, I wasn't able to capture logging info (didn't have remote syslog configured correctly) so I don't think it will be too helpful to open a case without logs. Next crash I get i'm reopening a case with VMware/Dell (i have 7 servers with 38 day uptime now so probably any day now).
Experienced one more PSoD after 24 days on one host. My VMware Ticket was closed with the suggestion to create an RSS Feed for the HCL Site http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=vsanio&productid=34857&vcl=t...
But there is still no new driver 😉
I am using driver version 6.608.11.00-1OEM on one of my hosts. So far this host is stable, but the uptime is only 25 days till now.
Another crash - will open another case with Dell..
Had 2 nodes crash again (no PSOD) but nodes eventually become non-responsive and H730 raid controller dropping all disk groups on the nodes, reopened case with VMware, will see what comes of that.
starts with lots of these (lines are repeated hundreds of times for each disk):
2016-02-04T13:32:58.958Z vsan-c02-n02 vmkernel: cpu8:33507)ScsiCore: 1609: Power-on Reset occurred on naa.50000c0f029bce50"
a few hours after, eventually becomes:
2016-02-04T14:06:09.755Z vsan-c02-n02 vmkernel: cpu8:33507)WARNING: ScsiDeviceIO: 1243: Device naa.50000c0f029b1e58 performance has deteriorated. I/O latency increased from average value of 11894 microseconds to 304153 microseconds.
2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 1ed1b156-13c3-9549-292a-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk
2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 1ed1b156-9262-9349-1eb7-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk
2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 1ed1b156-d5ec-8e49-bbd9-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk
2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 0cd1b156-2eb3-514d-21aa-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk
2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 0cd1b156-f9af-4f4d-04a8-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk
2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 0cd1b156-5392-4b4d-4498-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk
2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 01d1b156-3705-636b-7477-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk
2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 01d1b156-b364-5f6b-bad6-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk
then this (errors repeat indefinitely until host is nonresponsive):
2016-02-04T14:08:20.258Z vsan-c02-n02 vmkernel: cpu5:33229)lsi_mr3: fusionWaitForOutstanding:2573: megasas: [ 5]waiting for 104 commands to complete
2016-02-04T14:11:15.446Z vsan-c02-n02 vmkernel: cpu18:33229)lsi_mr3: fusionWaitForOutstanding:2588: megaraid_sas: pending commands remain after waiting, will reset adapter.
2016-02-04T14:11:15.446Z vsan-c02-n02 vmkernel: cpu18:33229)WARNING: lsi_mr3: fusionReset:2641: megaraid_sas: resetting fusion adapter.
2016-02-04T14:11:25.398Z vsan-c02-n02 vmkernel: cpu17:33229)WARNING: lsi_mr3: megasas_transition_to_ready:1735: megasas: Waiting for FW to come to ready state
2016-02-04T14:11:32.052Z vsan-c02-n02 vmkernel: cpu5:33229)WARNING: lsi_mr3: megasas_transition_to_ready:1848: megasas: FW now in Ready state
2016-02-04T14:11:32.102Z vsan-c02-n02 vmkernel: cpu5:33229)WARNING: lsi_mr3: mfiDMAAlloc:205: mfi: failed to allocate 124 DMA buffer. Out of memory.
2016-02-04T14:11:32.102Z vsan-c02-n02 vmkernel: cpu5:33229)WARNING: lsi_mr3: mfiGetAdapterInfo:1207: Failed to alloc mem for controller info
2016-02-04T14:11:32.102Z vsan-c02-n02 vmkernel: cpu5:33229)WARNING: lsi_mr3: fusionReset:2800: Failed to get adapter info
2016-02-04T14:12:23.636Z vsan-c02-n02 vmkwarning: cpu4:4610841)WARNING: lsi_mr3: fusionReset:2759: megaraid_sas: fusionIocInit() failed!
2016-02-04T14:12:57.701Z vsan-c02-n02 vmkernel: cpu27:4610841)WARNING: lsi_mr3: fusionReset:2825: megaraid_sas: Reset failed, killing adapter.
2016-02-04T14:12:57.701Z vsan-c02-n02 vmkernel: cpu27:4610841)lsi_mr3: mfi_TaskMgmt:259: Processing taskMgmt abort for device: vmhba0:C0:T4:L0
2016-02-04T14:12:57.701Z vsan-c02-n02 vmkernel: cpu27:4610841)lsi_mr3: mfi_TaskMgmt:267: ABORT
2016-02-04T14:12:58.701Z vsan-c02-n02 vmkwarning: cpu2:4610842)WARNING: lsi_mr3: fusionWaitForOutstanding:2562: megasas: Found FW in FAULT state, will reset adapter.
2016-02-04T14:12:58.701Z vsan-c02-n02 vmkernel: cpu2:4610842)WARNING: lsi_mr3: fusionReset:2641: megaraid_sas: resetting fusion adapter.
2016-02-04T14:12:58.701Z vsan-c02-n02 vmkernel: cpu2:4610842)WARNING: lsi_mr3: fusionReset:2668: megaraid_sas: Reset not supported, killing adapter.
vsphere 6.0u1b, esxi on 6.0u1a, h730 driver is 6.606.12.00-1OEM, already following KBs 2136374, 2135494, 2109665 and all hardware+firmware is HCL.
We had another all disk group failure on Saturday our selves. One disk didn't come back online and Dell replaced it. I haven't heard back on possible reason(s) why - uploaded logs to my case earlier in the week. I'll let you know what I hear.
Quick question. Are you using the PERC730, or the 730 Mini (the weird mezzanine card one).
Also you said Dell replaced it, what did they replaced (The controller, the mid-plane, a full hero kit?)
We're using the Mini:
PERC H730 Mini (Embedded)
Dell has a beta driver available that resolves this particular PSOD/crash. I would suggest requesting that your Dell support case be escalated to Dell IPS as their escalated/high level support should be well aware of this fix.
Thanks, have requested on my support case..
You're welcome Alain. Further, if you know have a VMware support case and know the number, I can try to speed up the process of obtaining that driver for you. If you'd like to keep it private you can send it to me with a direct message on this board.
Do you have any SNMP/OpenManage logs for what the temprature was like on the PERC mini when things went south. I'm curious if they spiked. Also what are you seeing as the normal operating temperature range.
Thanks for the offer, our VMWare support is provided by Dell as part of our Pro Support so to sure it'll help.