VMware Cloud Community
justinbennett
Enthusiast
Enthusiast

VSAN Node Crashed - R730xd - PF Exception 14 in world 33571: Cmpl-vmhba0- IP 0x41802c3abd44 addr 0x50

Anyone had a similar issue?

Host has a PERC H730p controller. Looks like the disks were resetting prior to the crash according to the system's lifecycle controller.

On call with Dell Support. Planning on a VMware support call too.

Thanks in advance!

Justin

2015-11-02 22_00_41-- Remote Desktop Connection.png

vsan.png

2015-11-02 22_29_53-- Remote Desktop Connection.png

102 Replies
alainrussell
Enthusiast
Enthusiast

No details on the card temp, but our operating temperatures look pretty low.

temp.png

0 Kudos
RS_1
Enthusiast
Enthusiast

Hi Folks, Yesterday i got a message from VMware to tell me that now Dell suspect SSD/HDD/Backplane firmware. After many tries from several sides, i was not able to get my hand on the famous beta driver that was supposed to fix this issue. Anyway no PSOD since i applied those adv settings in October. http://itsintehcloud.com/2015/07/dell-r730h730-p-raid-controller-firmware-issues-with-all-flash-vsan...

0 Kudos
alainrussell
Enthusiast
Enthusiast

cdekter Can you confirm the beta driver version you mentioned is PERC9.2-6.606.14.00 (I've been provided one by Dell but want to confirm it's the correct version).

0 Kudos
cdekter
VMware Employee
VMware Employee

alainrussell That would indeed appear to be the correct version.

0 Kudos
elerium
Hot Shot
Hot Shot

Hi Folks, Yesterday i got a message from VMware to tell me that now Dell suspect SSD/HDD/Backplane firmware. After many tries from several sides, i was not able to get my hand on the famous beta driver that was supposed to fix this issue. Anyway no PSOD since i applied those adv settings in October. http://itsintehcloud.com/2015/07/dell-r730h730-p-raid-controller-firmware-issues-with-all-flash-vsan...http://itsintehcloud.com/2015/07/dell-r730h730-p-raid-controller-firmware-issues-with-all-flash-vsan...

I got the same reply from Dell and no Beta driver. VMWare had me apply advanced settings as well, although slightly different than your link:

esxcfg-advcfg -s 110000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 1 /LSOM/diskIoRetryFactor

VSAN is a pretty amazing product, if I could just eliminate disk groups being dropped by H730 controller every month, would be golden.

0 Kudos
alainrussell
Enthusiast
Enthusiast

Looks like the VSAN HCL was updated today (recently) as I can see some Tweets about changes people have seen with HP hardware. Any ETA on the Dell driver updates?

0 Kudos
cdekter
VMware Employee
VMware Employee

A new version of the lsi_mr3 driver for Dell is currently undergoing certification testing and should be placed on the HCL soon if there are no issues found.

0 Kudos
RS_1
Enthusiast
Enthusiast

I'm sorry to tell you that Chris but i hear this one since the VMworld...

alainrussell
Enthusiast
Enthusiast

We had another issue today on our problem node. Not a crash as we had with the previous driver, but VSAN reported some issues with disks, and host become unresponsive. Looking at the logs it was looping though issues with the driver (same as this KB) VMware KB: Using a Dell Perc H730 controller in an ESXi 5.5 or ESXi 6.0 host displays IO failures or...

Our only option to fix it was a power reset on the host, no VMs would migrate, no maintenance mode and the VSAN health checks etc would not load - they all timed out.

I'll revert back to the HCL driver now, I'd prefer a crash to the issue we had today.

0 Kudos
discombob93
Contributor
Contributor

Unfortunately... we are experiencing the same issue with PSOD related to vmhba0 every few weeks on (so far) 5 of our 18 R730xd servers.  We aren't able to put all this expenditure in production after 6 months+ trying to troubleshoot.  I'm waiting to hear something back from Dell currently.

Controller:

R730xd with PERC H730 Mini (Embedded), firmware: 25.3.0.0016

LSI_MR3 ESXi driver at version 6.606.12.00-1OEM.600.0.0.2159203

Enclosure:

BP13G+EXP with firmware 3.03

Disks used:

  SSD cache

Part Number PH0HKK8C2640252O098IA00

Manufacturer TOSHIBA

Product ID PX02SMF040

Revision A3AF

Serial Number 25O0A07GT0QB

Manufactured Day 0

Manufactured Week 9

Manufactured Year 2015

  spindles

Part Number CN0RMCP3726224CP07XQA01

Manufacturer SEAGATE

Product ID ST1200MM0007

Revision IS06

Serial Number S3L180TC

Manufactured Day 7

Manufactured Week 52

Manufactured Year 2014

VMware SR#: 16898741702

Dell support SR#: 926404432

0 Kudos
cdekter
VMware Employee
VMware Employee

Hi again folks,

Here is a configuration change we have been working on refining. It will be added to the published KB articles relating to the H730 controller today. This configuration change could potentially mitigate several of the symptoms, specifically drives going offline as well as the periodic PSOD where you see completeCmdFusion referenced in the PSOD stack.

esxcfg-advcfg -s 110000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 1 /LSOM/diskIoRetryFactor

These changes become effective immediately and will persist between reboots. At any time you can revert back to the default values as follows:

esxcfg-advcfg -s 20000 /LSOM/diskIoTimeout

esxcfg-advcfg -s 3 /LSOM/diskIoRetryFactor

There is no risk in applying the updated configuration values in terms of performance or data redundancy. This change only adjusts the VSAN IO retry behavior to be more lenient.

alainrussell
Enthusiast
Enthusiast

Thanks, I'll apply this to our problem node today.

Any word on the updated drivers as well (guessing they will come with Update 2?)

0 Kudos
cdekter
VMware Employee
VMware Employee

I would advise applying the configuration change to all nodes for best results, as you may find the problem moves around - it doesn't appear to be a specific hardware defect that would be isolated to one machine.

There should be an update to the VSAN HCL separate from the upcoming release to add a new driver version for the H730 controller. Unfortunately there is currently no ETA for this but it is naturally at the top of the priority list.

0 Kudos
elerium
Hot Shot
Hot Shot

Thanks for the update on drivers & HCL.

So far with the LSOM timeout changes since last month I have two nodes that have uptime of 60+ days (one at 61 and one at 85 days!), previous record on my cluster was 45 days max uptime before I would see raid card/drives drop out.

0 Kudos
alainrussell
Enthusiast
Enthusiast

Ok, thanks - have done (fingers crossed).

0 Kudos
cdekter
VMware Employee
VMware Employee

Thank you elerium, that's an excellent data point - much appreciated.

0 Kudos
JohnNicholsonVM
Enthusiast
Enthusiast

Anyone interested in testing a script to check (and set) this across the cluster?

NickBowie
Enthusiast
Enthusiast

Probably not the most elegant, but this should do it:

Get-Cluster -name <your_cluster_name> | Get-VMhost | Get-AdvancedSetting -name LSOM.diskIoTimeout | Set-AdvancedSetting -Value 110000 -Confirm:$false

Get-Cluster -name <your_cluster_name> | Get-VMhost | Get-AdvancedSetting -name LSOM.diskIoRetryFactor | Set-AdvancedSetting -Value 1 -Confirm:$false

0 Kudos
DrewDeM
Enthusiast
Enthusiast

I saw that the HCL does not show any of the H730 controllers being supported for version 6.0 U2.  Is this and oversight?  Hopefully the controllers will be added shortly and VMware hasn't decided to drop support for this whole family of I/O controllers.  Does anyone know that answer to this?

0 Kudos