Anyone had a similar issue?
Host has a PERC H730p controller. Looks like the disks were resetting prior to the crash according to the system's lifecycle controller.
On call with Dell Support. Planning on a VMware support call too.
Thanks in advance!
Justin
No details on the card temp, but our operating temperatures look pretty low.
Hi Folks, Yesterday i got a message from VMware to tell me that now Dell suspect SSD/HDD/Backplane firmware. After many tries from several sides, i was not able to get my hand on the famous beta driver that was supposed to fix this issue. Anyway no PSOD since i applied those adv settings in October. http://itsintehcloud.com/2015/07/dell-r730h730-p-raid-controller-firmware-issues-with-all-flash-vsan...
cdekter Can you confirm the beta driver version you mentioned is PERC9.2-6.606.14.00 (I've been provided one by Dell but want to confirm it's the correct version).
alainrussell That would indeed appear to be the correct version.
Hi Folks, Yesterday i got a message from VMware to tell me that now Dell suspect SSD/HDD/Backplane firmware. After many tries from several sides, i was not able to get my hand on the famous beta driver that was supposed to fix this issue. Anyway no PSOD since i applied those adv settings in October. http://itsintehcloud.com/2015/07/dell-r730h730-p-raid-controller-firmware-issues-with-all-flash-vsan...http://itsintehcloud.com/2015/07/dell-r730h730-p-raid-controller-firmware-issues-with-all-flash-vsan...
I got the same reply from Dell and no Beta driver. VMWare had me apply advanced settings as well, although slightly different than your link:
esxcfg-advcfg -s 110000 /LSOM/diskIoTimeout
esxcfg-advcfg -s 1 /LSOM/diskIoRetryFactor
VSAN is a pretty amazing product, if I could just eliminate disk groups being dropped by H730 controller every month, would be golden.
Looks like the VSAN HCL was updated today (recently) as I can see some Tweets about changes people have seen with HP hardware. Any ETA on the Dell driver updates?
A new version of the lsi_mr3 driver for Dell is currently undergoing certification testing and should be placed on the HCL soon if there are no issues found.
I'm sorry to tell you that Chris but i hear this one since the VMworld...
We had another issue today on our problem node. Not a crash as we had with the previous driver, but VSAN reported some issues with disks, and host become unresponsive. Looking at the logs it was looping though issues with the driver (same as this KB) VMware KB: Using a Dell Perc H730 controller in an ESXi 5.5 or ESXi 6.0 host displays IO failures or...
Our only option to fix it was a power reset on the host, no VMs would migrate, no maintenance mode and the VSAN health checks etc would not load - they all timed out.
I'll revert back to the HCL driver now, I'd prefer a crash to the issue we had today.
Unfortunately... we are experiencing the same issue with PSOD related to vmhba0 every few weeks on (so far) 5 of our 18 R730xd servers. We aren't able to put all this expenditure in production after 6 months+ trying to troubleshoot. I'm waiting to hear something back from Dell currently.
Controller:
R730xd with PERC H730 Mini (Embedded), firmware: 25.3.0.0016
LSI_MR3 ESXi driver at version 6.606.12.00-1OEM.600.0.0.2159203
Enclosure:
BP13G+EXP with firmware 3.03
Disks used:
SSD cache
Part Number PH0HKK8C2640252O098IA00
Manufacturer TOSHIBA
Product ID PX02SMF040
Revision A3AF
Serial Number 25O0A07GT0QB
Manufactured Day 0
Manufactured Week 9
Manufactured Year 2015
spindles
Part Number CN0RMCP3726224CP07XQA01
Manufacturer SEAGATE
Product ID ST1200MM0007
Revision IS06
Serial Number S3L180TC
Manufactured Day 7
Manufactured Week 52
Manufactured Year 2014
VMware SR#: 16898741702
Dell support SR#: 926404432
Hi again folks,
Here is a configuration change we have been working on refining. It will be added to the published KB articles relating to the H730 controller today. This configuration change could potentially mitigate several of the symptoms, specifically drives going offline as well as the periodic PSOD where you see completeCmdFusion referenced in the PSOD stack.
esxcfg-advcfg -s 110000 /LSOM/diskIoTimeout
esxcfg-advcfg -s 1 /LSOM/diskIoRetryFactor
These changes become effective immediately and will persist between reboots. At any time you can revert back to the default values as follows:
esxcfg-advcfg -s 20000 /LSOM/diskIoTimeout
esxcfg-advcfg -s 3 /LSOM/diskIoRetryFactor
There is no risk in applying the updated configuration values in terms of performance or data redundancy. This change only adjusts the VSAN IO retry behavior to be more lenient.
Thanks, I'll apply this to our problem node today.
Any word on the updated drivers as well (guessing they will come with Update 2?)
I would advise applying the configuration change to all nodes for best results, as you may find the problem moves around - it doesn't appear to be a specific hardware defect that would be isolated to one machine.
There should be an update to the VSAN HCL separate from the upcoming release to add a new driver version for the H730 controller. Unfortunately there is currently no ETA for this but it is naturally at the top of the priority list.
Thanks for the update on drivers & HCL.
So far with the LSOM timeout changes since last month I have two nodes that have uptime of 60+ days (one at 61 and one at 85 days!), previous record on my cluster was 45 days max uptime before I would see raid card/drives drop out.
Ok, thanks - have done (fingers crossed).
Thank you elerium, that's an excellent data point - much appreciated.
Anyone interested in testing a script to check (and set) this across the cluster?
Probably not the most elegant, but this should do it:
Get-Cluster -name <your_cluster_name> | Get-VMhost | Get-AdvancedSetting -name LSOM.diskIoTimeout | Set-AdvancedSetting -Value 110000 -Confirm:$false Get-Cluster -name <your_cluster_name> | Get-VMhost | Get-AdvancedSetting -name LSOM.diskIoRetryFactor | Set-AdvancedSetting -Value 1 -Confirm:$false |
I saw that the HCL does not show any of the H730 controllers being supported for version 6.0 U2. Is this and oversight? Hopefully the controllers will be added shortly and VMware hasn't decided to drop support for this whole family of I/O controllers. Does anyone know that answer to this?