VMware Cloud Community
justinbennett
Enthusiast
Enthusiast

VSAN Node Crashed - R730xd - PF Exception 14 in world 33571: Cmpl-vmhba0- IP 0x41802c3abd44 addr 0x50

Anyone had a similar issue?

Host has a PERC H730p controller. Looks like the disks were resetting prior to the crash according to the system's lifecycle controller.

On call with Dell Support. Planning on a VMware support call too.

Thanks in advance!

Justin

2015-11-02 22_00_41-- Remote Desktop Connection.png

vsan.png

2015-11-02 22_29_53-- Remote Desktop Connection.png

102 Replies
alainrussell
Enthusiast
Enthusiast

We got an update on our Dell support case today - details below. Based on this we've updated the problem host we had. It's also running the D414 firmware update (SanDisk SSDs)..

We now have received the confirmation from the PERC Engineering team and the VMware Engineering that the latest driver and firmware are tested to work on VSAN. VMware is expecting that PERC new FW and Driver will be posted on the vSAN HCL for ESXi 6.0 and 6.0 U1 next week or the week after next. In the meantime,  VMware engineering is happy for customers to start deploying new FW/Driver.

Driver Download link: https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI60-LSI-LSI-MR3-69038200-1OEM&productId... 

Firmware Download link: http://downloads.dell.com/FOLDER03512329M/2/SAS-RAID_Firmware_TMKHJ_WN32_25.4.0.0015_A05.EXE 

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

That's great news! I'll probably still wait for official HCL to be updated before applying but guessing 6.2 support will be around the corner.

Reply
0 Kudos
kelwood
Enthusiast
Enthusiast

It looks as though a resolution is on the horizon but still nothing official in the HCL. Is there any new information out there pertaining to timing of an updated HCL?

alainrussell
Enthusiast
Enthusiast

Looks like there is a new article on the H730 (as well as update to the original) which has a different setting for /LSOM/diskIoRetryFactor

Original Article Updated: VMware KB: Using a Dell Perc H730 controller in an ESXi 5.5 or ESXi 6.0 host displays IO failures or...

New Article: VMware KB: Required VSAN and ESXi configuration for controllers based on the LSI 3108 chipset

The updated recommended settings are:

esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor

discombob93
Contributor
Contributor

After almost a year of problems, we have now had 18 r730xd VSAN hosts running stable for 45 days without a PSOD on PERC H730 Mini with firmware 25.3.0.0016 and LSI_MR3 ESXi driver at 6.606.12.00-1OEM.600.0.0.2159203.  Drives are Toshiba PX02SMF040 rev A3AF and Seagate ST1200MM0007 rev IS06.

LSOM timeout values are at the original recommendation of:

esxcfg-advcfg -s 110000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 1 /LSOM/diskIoRetryFactor

Interesting that they have now decided to bring the diskIoRetryFactor up to 4 (default ESXi image setting is only 3)... anyone know what this variable is actually doing, as that seems to be a substantial change.

Stability is a step forward, but we really need the ability to patch and update without worry that an upgrade is going to take down our business, so still waiting on a new driver certified for 6.2 and beyond and are reluctant to move any critical workloads to VSAN from our legacy SANs until it comes down the VUM channel.

Reply
0 Kudos
zdickinson
Expert
Expert

Good afternoon, we did vSAN for DR.  It taught me that vSAN was not prod ready for us, to be managed by us.  It also taught me that I like converged infrastructure.  It also, also taught me that I like vSAN managed by someone else.  I'm interested to see where EMC VxRAIL goes.  That might be our next iteration of infrastructure.  Thank you, Zach.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

From my troubleshooting and research, the timeout values are a workaround to a problem/bug in the LSI 3108 chipset (used in H730 controllers). In periods of high IO or when the controller can't communicate with the underlying disks, the controller issues a power on disk reset to resolve the problem, this happens infrequently but in normal operation it shouldn't cause issues. However for this chipset, the disk reset seems to be problematic and either some combination of driver/firmware issue causes the disk reset to be issued continuously. I've observed after a large number of power on reset attempts, the controller itself attempts to reset. This is ugly when this happens and the controller itself crashes and disk groups drop when this occurs.

The escfg-advcfg timeouts appear to be a workaround so the raid card does not issue the power on reset anywhere near as frequently, basically telling the raid controller to wait much longer before a reset occurs. In turn this prevents the controller crash and dropping of disk groups. A disk not responding from high I/O or some other reason would usually respond within 100000ms so this seems like a safe workaround. The downside to this is if you have a disk that's about to die, dying disks have a symptom of stalling randomly for long periods and this setting would probably allow that instead of failing the disk outright. The end result on dying disk hardware would be inconsistent I/O performance until the disk actually dies.

VMware support shared that this issue occurs on all raid cards utilizing the LSI 3108 which Dell, IBM and Cisco all use and the timeout settings are supposed to be applied to any VSAN installation utilizing LSI 3108 based raid controllers. I do believe the underlying problem is LSI firmware/driver with this particular chipset. Supposedly a permanent fix may be ready by the end of May.

tehkuhnz
Contributor
Contributor

kreitzer
Contributor
Contributor

Thank you for your analysis, elerium! Going back through our logs things definitely match your understanding. Thanks to you we now have an early warning system for these crashes! When we start seeing mass disk resets we can evacuate and reboot hosts. According to our logs these will start hours before a crash.

I share your concern over the changes VMware has requested. In our last support case regarding this issue it was recommended we maintain the timeout settings even after applying the updated driver and firmware. I'm hesitant to for exactly the reasons you brought up. Have you or anyone else had similar recommendation from VMware?

As far as you know, do other users of the LSI 3108 RoC implement the same pass through feature? I get the impression the fairly green code in this feature is what we're all suffering from. I wonder if a non pass through implementation would suffer similar problems.

Reply
0 Kudos
JohnNicholsonVM
Enthusiast
Enthusiast

I understand there are firmware updates coming for other product families.  Dell was unique in that they didn't offer a certified 12Gbps SAS HBA (the HBA 330 will hopefully be certified soon) so it had the largest concentration of 3108 users.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

So far i've been issue free on VSAN 6.1 since applying the timeouts, prior to applying these hosts would crash between 30-45 days. I would strongly recommend applying the timeout settings since potential performance problems that may or may not materialize still beats guaranteed crashes every 30-45 days, i have hosts running issue free for 136 days now.

I believe the new timeout settings just wait 100s with 4 retires. I have seen varying timeout values on other storage systems between 20s and 180s so I don't think 100s is anything to be worried about.

Regarding LSI 3108 RoC, off documents from the LSI website I don't see that cards being released with this chipset natively do passthrough. I believe the OEMs maybe in combination with working with LSI are providing the passthrough feature on the cards they are releasing.

Reply
0 Kudos
JohnNicholsonVM
Enthusiast
Enthusiast

Technically it was an undocumented feature in the 2208 chipsets (the 6Gbps MegaRAID product). It caused disk failures ~45 days if you used it and put heavy load on the controllers.

This is why all of the 6Gbps MegaRAID products have a RAID-0 certification only.

As far as OEM/LSI relationships I can't comment on that.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

I did initially implement VSAN 6.0 using RAID0/non pass-through when the H730 passthrough mode was a super buggy mess and slow. The RAID0 setup worked and was stable. Dell did fix the majority of issues in a firmware release last year for passthrough mode and VMWare HCL dropped support for RAID0 for H730. While it does work, it's not supported on HCL or by VMware so I wouldn't recommend it.

Reply
0 Kudos
JohnNicholsonVM
Enthusiast
Enthusiast

The H730 was never submitted by Dell for RAID-0 certification as far as I am aware (And I've spent a lot of time looking at the VCG).

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

You are right, looks like H730 never certified for RAID0. Some of us used it to get around all the crashes/problems when it was first added to HCL last year, I really don't know how the first certified firmware from a year ago made it to HCL (since updated/fixed), it was so incredibly buggy and slow. I do not miss those times...Dell PERC H730p / LSI 3108 /Invader implementations

Reply
0 Kudos
JohnNicholsonVM
Enthusiast
Enthusiast

I know of a few customers that moved to RAID-0 to work around it. I heard of one still having some issues, and Dell never chose to submit it for certification with that configuration (remember certifications are a "push" system not a pull"). That said, I've had private conversations with someone not using VSAN and using RAID mode on this controller and they were experiencing reproducible even more severe problems that were likely the result of this same issue (they will be testing following this firmware). 

As far as what has been done to make sure something doesn't slip through here's a few of my thoughts.

New tests have been added to certification, and a lot has been learned of what to look for since then. Learning how to recreate this bug added some interesting new things.

Staff with specialties in hardware (from the semiconductor/flash vendors) have been added in engineering. There have been staff in general added to the certification team since its beginning. 

VMware engineering is working a even more closely with the OEM's and with the back end controller manufacturers.  A product with 3500 customers to be frank gets a lot more attention than one with a few hundred. Talking with others in the software defined storage industry VSAN was not the only one to run into this issue (just being #1 makes it more visible to our support and engineering). 

Now there will always be hardware bugs (NVMe isn't near as battle tested as SCSI), but there are a few bright spots to look forward to going forward.

1. This is likely the last serious generation of SATA/SAS.  While it will stick around (For capacity devices) for quite a while, I don't expect anything new to be done so things should only get better. 

2. NVMe illuminates the controller middleman. The controller is baked into the drive, and Intel/Micron/Samsung/Sandisk etc will "own" the driver/firmware end to end.

3. The relationships built over the past 2 years by VSAN engineering with the OEM's and controller vendors are only going to get stronger. 

4. As the customer count goes up so does "heard immunity"

5. I suspect some of the driver/firmware issues from 5.5-6.x were the result of IO devices largely shifting to native drivers (from the legacy linux shim driver system). There  were some changes (and some loss of herd immunity). As more 5.5 and 6.x customers use these drivers a lot has been "shaken out" in the ecosystems and hopefully this transition is complete.

There's also some things I can't talk about (well for a bit) that will help reinforce VMware's commitment to maintaining a diverse supply chain with all of the major OEM's storage device vendors, but also helping customers maintain consistency of results.  

I personally was a customer/partner who had to deal with the impact of this bug (and joined the team 6 months ago). I understand what you've been dealing with and I can honestly say there was a strong commitment to get this fixed, and make sure it doesn't happen again.

elerium
Hot Shot
Hot Shot

I initially adopted VSAN knowing there would be some issues with "bleeding edge" new implementations. I've run into a few software bugs, mostly minor. Major pain has been with raid controller hardware issues with Dell being the least helpful. For the most part I've been very happy with VMware's commitment/response to all issues I've raised. 


Thank you for the response/feedback and comments on future outlook. I really like the VSAN product despite all the issues I've seen, in terms of price/performance for us it's been terrific.

Reply
0 Kudos
RS_1
Enthusiast
Enthusiast

Hi Guys, i just applied the new lsi_mr3 version 6.903.85.00-1OEM.600.0.0.2768847 driver with the new SAS-RAID_Firmware_VH28K_LN_25.4.0.0017_A06.BIN firmware on one cluster node and ended up with a lot of errors like those :

vmkernel: cpu2:1384458)DOM: DOMOwner_SetLivenessState:3889: Object 0c342357-6b46-3acc-7c4d-44a842148f4f lost liveness [0x439e27ac3640

vmkwarning: cpu29:35176 opID=d989d7ea)WARNING: VSAN: VsanIoctlCtrlNode:1909: a6302357-114a-30f4-5b6d-44a842148f4f: Failed to initialize the object : Not found

vmkernel: cpu34:35801 opID=3510649f)Vol3: 950: Unable to register file system 5d0a0557-ef34-7ebe-4af8-44a842148f4f for APD timeout notifications: Inappropriate ioctl for device

vmkernel: cpu45:33517)NMP: nmp_ResetDeviceLogThrottling:3345: last error status from device naa.644a84201ead82001e89676d2b69d0dc repeated 14 times

vmkernel: cpu35:33595)ScsiDeviceIO: 2645: Cmd(0x43a7000d1a00) 0x1a, CmdSN 0x91ad5 from world 0 to dev "naa.644a84201ead82001e8957d83dab073a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

vmkernel: cpu28:33595)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x1a (0x43a601964480, 34503) to dev "naa.644a84201ead82001e895bf77c948e57" on path "vmhba0:C2:T5:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

The server never actually finish to boot and loops on VSAN and disk errors and hostd alerts :

vmkernel: cpu20:36121)ALERT: hostd detected to be non-responsive

i'm trying to rollback to previous versions and let you know.

FYI i use RAID0 mode on this cluster.

Reply
0 Kudos
JohnNicholsonVM
Enthusiast
Enthusiast

Considering RAID-0 wasn't tested for this firmware (only pass through, which is the only certified option) it may be worth switching hosts back to pass through.

Reply
0 Kudos
RS_1
Enthusiast
Enthusiast

actually it was related to the network firmware update i did with the H730 upgrade, now that the other node has the same firmware it's ok...

chrome_2016-04-30_21-09-33.png

Reply
0 Kudos