VMware Cloud Community
justinbennett
Enthusiast
Enthusiast

VSAN Node Crashed - R730xd - PF Exception 14 in world 33571: Cmpl-vmhba0- IP 0x41802c3abd44 addr 0x50

Anyone had a similar issue?

Host has a PERC H730p controller. Looks like the disks were resetting prior to the crash according to the system's lifecycle controller.

On call with Dell Support. Planning on a VMware support call too.

Thanks in advance!

Justin

2015-11-02 22_00_41-- Remote Desktop Connection.png

vsan.png

2015-11-02 22_29_53-- Remote Desktop Connection.png

102 Replies
elerium
Hot Shot
Hot Shot

I have had the same issue occur with PF Exception 14 PSOD on same hardware. Case open with both Dell and VMWare. Dell closed the case saying they won't support me because i'm using 3rd party drives even though the drives are on VSAN HCL. This crashing doesn't occur often, probably see one node with the issue every 35 days or so but it still sucks.

Relevant errors from vmkernel:

2015-11-25T06:50:02.707Z cpu32:33230)lsi_mr3: mfi_TaskMgmt:259: Processing taskMgmt abort for device: vmhba0:C0:T12:L0

2015-11-25T06:50:02.707Z cpu32:33230)lsi_mr3: mfi_TaskMgmt:267: ABORT

2015-11-25T06:50:03.705Z cpu20:32875)lsi_mr3: mfi_TaskMgmt:259: Processing taskMgmt virt reset for device: vmhba0:C0:T12:L0

2015-11-25T06:50:03.705Z cpu20:32875)lsi_mr3: mfi_TaskMgmt:263: VIRT_RESET cmd # -1561398661

2015-11-25T06:50:03.705Z cpu20:32875)lsi_mr3: mfi_TaskMgmt:267: ABORT

2015-11-25T06:50:03.709Z cpu32:33230)lsi_mr3: fusionWaitForOutstanding:2573: megasas: [ 0]waiting for 55 commands to complete

2015-11-25T06:50:05.568Z cpu13:33507)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x28 (0x439fa0228800, 0) to dev "naa.50000c0f023f6cd8" on path "vmhba0:C0:T11:L0" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL

2015-11-25T06:50:05.568Z cpu13:33507)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.50000c0f023f6cd8" state in doubt; requested fast path state update...

2015-11-25T06:50:05.568Z cpu13:33507)ScsiDeviceIO: 2607: Cmd(0x439fa0228800) 0x28, CmdSN 0xc096 from world 0 to dev "naa.50000c0f023f6cd8" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

The vmhba0:c0:T12:L0 device happens to be a Dell branded Seagate 2.5" 7200 RPM SAS drive, and this seems to pop up first across all my crashes. Dell just points the finger at 3rd party non Dell drives being the issue though which I think is just BS.

Reply
0 Kudos
cdekter
VMware Employee
VMware Employee

Hi elerium,

Let me get in touch with our Dell engineering contacts regarding this. I believe I'm familiar with the case but just to validate, could you please post the VMware case number?

The logs are not typical for what I have seen in other cases encountering this PSOD. Do you have any VMFS volumes on any drives attached to the H730 controller? If so please refer to the following KB: VMware KB: Deployment guidelines for running VMware Virtual SAN and VMware vSphere VMFS datastores o...

Here is another KB regarding the known drive failure issue with the H730 controller: VMware KB: Avoiding a known drive failure issue when Dell PERC H730 controller is used with VMware V...

If you implement the recommendations from these two KB articles it may help mitigate the PSOD for the time being.

elerium
Hot Shot
Hot Shot

case 15815551311, I have followed KB 2135494 for drivers/firmware already.

I haven't seen KB 2136374 regarding VMFS + VSAN on the same controller but it does describe the setup we are using. Our setup is HBA using 1 drive for VMFS for coredump/scratch partition and remaining 14 drives for VSAN.

Edit: removed the VMFS coredump/scratch drive and am using network syslog and network coredump. Given the error patterns where the VMFS coredump/scratch drive would reset followed by raid crashes, I'm pretty hopeful that this fixes my PSOD and drive dropout problems on the H730 controller.

Thank you again cdekter.

Reply
0 Kudos
cdekter
VMware Employee
VMware Employee

Hi elerium,

Just saw your edit - your assessment based on the log messages that the VMFS volume was the trigger seems likely to be accurate to me. We are actively working with Dell on resolving the issues that manifest when using VMFS alongside VSAN on this controller, and hope to have a fix in the near future.

Reply
0 Kudos
haftic27
Contributor
Contributor

hi RS_1, same on me. In the end VMware Support told me, my PSOD occurs because of a problem with lsi_mr3 version 6.606.12.00-1OEM.

Problem should be fixed with version 6.608.11.00-1OEM, but driver is still not on the HCL. Hopefully vSAN HCL will be updated soon.

Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

Experienced my first crash that looks like this issue today - adding here for reference.PSOD.png

Reply
0 Kudos
Bleeder
Hot Shot
Hot Shot

That can't be right, can it?  From what I can see, that newer fixed driver has been available for several months.

Reply
0 Kudos
justinbennett
Enthusiast
Enthusiast

We're on are second PSOD and three all disk group failures within the last month. We also have the lsi_mr3 (6.606.12.00-1OEM.600.0.0.2159203) driver, along the recommended firmware updates and drivers. I'm blasting ESXi600-201601001 and BIOS update 1.5.4 to all my nodes.

2016-01-25 07_56_03-.png

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

The current VSAN recommended driver is actually pretty old, there are quite a few newer drivers since then, but none have been given okayed for VSAN HCL. In addition, from what I can see, some of these newer drivers are designed for use with firmwares that aren't available from Dell (but have been published by LSI) 24.9.0-0022

Current VSAN HCL H730 driver:

6.606.12.00-1OEM (05-19-2015) https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI60-LSI-LSI-MR3-66061200-1OEM&productId...

Other LSI drivers (same productId), but not qualified for VSAN or may also require firmwares that aren't available from Dell:

6.608.09.00-1OEM (06-24-2015) https://my.vmware.com/group/vmware/details?productId=491&downloadGroup=DT-ESXI60-LSI-LSI-MR3-6608090...

6.609.08.00-1OEM (09-30-2015) https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXI60-LSI-LSI-MR3-66090800-1OEM&product...

6.608.11.00-1OEM (10-23-2015) https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXI60-LSI-LSI-MR3-66081100-1OEM&product...

6.608.12.00-1OEM (11-27-2015) https://my.vmware.com/group/vmware/details?productId=491&downloadGroup=DT-ESX60-LSI-LSI-MR3-66081200...

6.610.17.00-1OEM (01-01-2016) https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI60-LSI-LSI-MR3-66101700-1OEM&productId...

I do hope we get an update from VMware or Dell soon regarding these. When the problem manifests, I see hundreds of disks resets in a short timeframe from iDrac/lifecycle controller, followed by the raid card being non-responsive which ends up dropping the disk group. I haven't seen the PSOD since my last post after following cdekter's advice about not having a VMFS on the same machine, however I still experience dropping of disk groups. This happens infrequently but it still happens maybe every 35 days for me. The last crash I had a week ago, I wasn't able to capture logging info (didn't have remote syslog configured correctly) so I don't think it will be too helpful to open a case without logs. Next crash I get i'm reopening a case with VMware/Dell (i have 7 servers with 38 day uptime now so probably any day now).

haftic27
Contributor
Contributor

Experienced one more PSoD after 24 days on one host. My VMware Ticket was closed with the suggestion to create an RSS Feed for the HCL Site http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=vsanio&productid=34857&vcl=t...

But there is still no new driver 😉

I am using driver version 6.608.11.00-1OEM on one of my hosts. So far this host is stable, but the uptime is only 25 days till now.

Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

Another crash - will open another case with Dell..

crash.png

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

Had 2 nodes crash again (no PSOD) but nodes eventually become non-responsive and H730 raid controller dropping all disk groups on the nodes, reopened case with VMware, will see what comes of that.

starts with lots of these (lines are repeated hundreds of times for each disk):

2016-02-04T13:32:58.958Z vsan-c02-n02 vmkernel: cpu8:33507)ScsiCore: 1609: Power-on Reset occurred on naa.50000c0f029bce50"


a few hours after, eventually becomes:

2016-02-04T14:06:09.755Z vsan-c02-n02 vmkernel: cpu8:33507)WARNING: ScsiDeviceIO: 1243: Device naa.50000c0f029b1e58 performance has deteriorated. I/O latency increased from average value of 11894 microseconds to 304153 microseconds.

2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 1ed1b156-13c3-9549-292a-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk

2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 1ed1b156-9262-9349-1eb7-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk

2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 1ed1b156-d5ec-8e49-bbd9-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk

2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 0cd1b156-2eb3-514d-21aa-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk

2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 0cd1b156-f9af-4f4d-04a8-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk

2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 0cd1b156-5392-4b4d-4498-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk

2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 01d1b156-3705-636b-7477-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk

2016-02-04T14:06:32Z vsan-c02-n02 epd[35550]: 01d1b156-b364-5f6b-bad6-ecf4bbe5d688 is stale: LSOM_OBJECT absent w/ fully published disk


then this (errors repeat indefinitely until host is nonresponsive):

2016-02-04T14:08:20.258Z vsan-c02-n02 vmkernel: cpu5:33229)lsi_mr3: fusionWaitForOutstanding:2573: megasas: [ 5]waiting for 104 commands to complete

2016-02-04T14:11:15.446Z vsan-c02-n02 vmkernel: cpu18:33229)lsi_mr3: fusionWaitForOutstanding:2588: megaraid_sas: pending commands remain after waiting, will reset adapter.

2016-02-04T14:11:15.446Z vsan-c02-n02 vmkernel: cpu18:33229)WARNING: lsi_mr3: fusionReset:2641: megaraid_sas: resetting fusion adapter.

2016-02-04T14:11:25.398Z vsan-c02-n02 vmkernel: cpu17:33229)WARNING: lsi_mr3: megasas_transition_to_ready:1735: megasas: Waiting for FW to come to ready state

2016-02-04T14:11:32.052Z vsan-c02-n02 vmkernel: cpu5:33229)WARNING: lsi_mr3: megasas_transition_to_ready:1848: megasas: FW now in Ready state

2016-02-04T14:11:32.102Z vsan-c02-n02 vmkernel: cpu5:33229)WARNING: lsi_mr3: mfiDMAAlloc:205: mfi: failed to allocate 124 DMA buffer. Out of memory.

2016-02-04T14:11:32.102Z vsan-c02-n02 vmkernel: cpu5:33229)WARNING: lsi_mr3: mfiGetAdapterInfo:1207: Failed to alloc mem for controller info

2016-02-04T14:11:32.102Z vsan-c02-n02 vmkernel: cpu5:33229)WARNING: lsi_mr3: fusionReset:2800: Failed to get adapter info

2016-02-04T14:12:23.636Z vsan-c02-n02 vmkwarning: cpu4:4610841)WARNING: lsi_mr3: fusionReset:2759: megaraid_sas: fusionIocInit() failed!

2016-02-04T14:12:57.701Z vsan-c02-n02 vmkernel: cpu27:4610841)WARNING: lsi_mr3: fusionReset:2825: megaraid_sas: Reset failed, killing adapter.

2016-02-04T14:12:57.701Z vsan-c02-n02 vmkernel: cpu27:4610841)lsi_mr3: mfi_TaskMgmt:259: Processing taskMgmt abort for device: vmhba0:C0:T4:L0

2016-02-04T14:12:57.701Z vsan-c02-n02 vmkernel: cpu27:4610841)lsi_mr3: mfi_TaskMgmt:267: ABORT

2016-02-04T14:12:58.701Z vsan-c02-n02 vmkwarning: cpu2:4610842)WARNING: lsi_mr3: fusionWaitForOutstanding:2562: megasas: Found FW in FAULT state, will reset adapter.

2016-02-04T14:12:58.701Z vsan-c02-n02 vmkernel: cpu2:4610842)WARNING: lsi_mr3: fusionReset:2641: megaraid_sas: resetting fusion adapter.

2016-02-04T14:12:58.701Z vsan-c02-n02 vmkernel: cpu2:4610842)WARNING: lsi_mr3: fusionReset:2668: megaraid_sas: Reset not supported, killing adapter.


vsphere 6.0u1b, esxi on 6.0u1a, h730 driver is 6.606.12.00-1OEM, already following KBs 2136374, 2135494, 2109665 and all hardware+firmware is HCL.

Reply
0 Kudos
justinbennett
Enthusiast
Enthusiast

We had another all disk group failure on Saturday our selves. One disk didn't come back online and Dell replaced it. I haven't heard back on possible reason(s) why - uploaded logs to my case earlier in the week. I'll let you know what I hear.

2016-02-04 15_27_35-Clipboard.png

Reply
0 Kudos
JohnNicholsonVM
Enthusiast
Enthusiast

Quick question.  Are you using the PERC730, or the 730 Mini (the weird mezzanine card one).

Also you said Dell replaced it, what did they replaced (The controller, the mid-plane, a full hero kit?)

Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

We're using the Mini:

PERC H730 Mini (Embedded)

controller.png

Reply
0 Kudos
cdekter
VMware Employee
VMware Employee

Dell has a beta driver available that resolves this particular PSOD/crash. I would suggest requesting that your Dell support case be escalated to Dell IPS as their escalated/high level support should be well aware of this fix.

Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

Thanks, have requested on my support case..

Reply
0 Kudos
cdekter
VMware Employee
VMware Employee

You're welcome Alain. Further, if you know have a VMware support case and know the number, I can try to speed up the process of obtaining that driver for you. If you'd like to keep it private you can send it to me with a direct message on this board.

Reply
0 Kudos
JohnNicholsonVM
Enthusiast
Enthusiast

Do you have any SNMP/OpenManage logs for what the temprature was like on the PERC mini when things went south. I'm curious if they spiked. Also what are you seeing as the normal operating temperature range.

Reply
0 Kudos
alainrussell
Enthusiast
Enthusiast

Thanks for the offer, our VMWare support is provided by Dell as part of our Pro Support so to sure it'll help.

Reply
0 Kudos