Dell PERC H730p / LSI 3108 /Invader implementation... - Page 4

motorad · ‎12-03-2014

Hello, would anybody happen to have any guidance or a proven config utilizing the PERC H730p/LSI 3108/Invader controller (FW 25.2.1.0037) in pass-thru with VSAN (ESXi 5.5 build 2143827). We are having stability issues that are exhibited via PSOD and intermittent permanent disk failures on a VSAN platform build based on the above in Dell R730 chassis with Fusion-io ioScale fronted Seagate 10k v7 ST1200MM0007 disk groups.

Common log events include “firmware in fault state” for the HBA and resets and aborts for the individual disks. Errors increment in the individual drive counters correlating with these events.

We have tried different HBA drivers, from the inbox mr3 (0.255.03.01-2) to the latest known PERC9 driver (6.901.55.00.1 - currently evaluating), including some of the mr3/megaraid drivers in between (6.605.10.00-1, 06.803.52.00, 06.803.73.00). The fallback of RAID0 has passed tests so far, but we all know what that means.

We know this configuration is not currently listed on the HCL. We do have cases currently open with VMware and Dell, and are in communication with LSI.

Any guidance would be greatly appreciated.

Hello, would anybody happen to have any guidance or a proven config utilizing the PERC H730p/LSI 3108/Invader controller (FW 25.2.1.0037) in pass-thru with VSAN (ESXi 5.5 build 2143827). We are having stability issues that are exhibited via PSOD and intermittent permanent disk failures on a VSAN platform build based on the above in Dell R730 chassis with Fusion-io ioScale fronted Seagate 10k v7 ST1200MM0007 disk groups.

Common log events include “firmware in fault state” for the HBA and resets and aborts for the individual disks. Errors increment in the individual drive counters correlating with these events.

We have tried different HBA drivers, from the inbox mr3 (0.255.03.01-2) to the latest known PERC9 driver (6.901.55.00.1 - what we’re currently evaluating), including some of the mr3/megaraid drivers in between (6.605.10.00-1, 06.803.52.00, 06.803.73.00). The fallback of RAID0 has passed tests so far, but we all know what that means.

We know this configuration is not currently listed on the HCL. We do have cases currently open with VMware and Dell, and are in communication with LSI.

Any guidance would be greatly appreciated.

cdekter · ‎07-31-2015

Hi everyone,

I understand the frustration that has been felt given the seriousness of the problem. Please be assured that engineering at the highest level has been engaged in resolving this issue: we have been working closely with Dell these past several days to come up with a resolution plan for the issues that have been reported in this thread with using the H730 series controllers with VSAN in pass-through mode. I have also personally tried to ensure that anyone with a support ticket open having these symptoms had the very latest information straight from engineering.

I am pleased to report that as of today the VSAN VCG has been updated to reflect new recommended driver and firmware versions for these controllers, which should resolve the symptoms reported in this thread: VMware Compatibility Guide: vsan

Here are the new recommended versions:

H730 controller series with ESXi 5.5u2
New recommended firmware version: 25.3.0.00016
New recommended driver: megaraid_perc9 version 6.902.73.00-1OEM

H730 controller series with ESXi 6.0
New recommended firmware version: 25.3.0.00016
New recommended driver: lsi_mr3 version 6.606.12.00-1OEM

madnote · ‎07-31-2015

Well it was a late night but finished moving the new drives onto the hosts. This would have been a lot less risky with 4 hosts. Certainly gonna request another one for next year.

Here is a before and after 10min Proactive Virtual SAN Storage Performance test sample (Going from SSD SATA to higher write IO SAS Drives) running the Performance characterization - 70/30 read/write mix, realistic, optimal flash cache usage:

Before:

VMDK Disk Number	Duration (sec)	IOPS	Throughput MB/s	Average Latency (ms)	Maximum Latency (ms)
0	600	1051	4.11	1.71	154.51
1	600	1082	4.23	1.59	157.63
2	600	1069	4.18	1.67	161.6
3	600	1234	4.82	1.52	67.94
4	600	1110	4.34	1.57	166.43
5	600	1231	4.81	1.46	79.99
6	600	1055	4.12	1.68	157.41
7	600	1057	4.13	1.69	158.72
8	600	1303	5.09	1.43	152.87
9	0	1075	4.2	1.64	160.87

After:

VMDK Disk Number	Duration (sec)	IOPS	Throughput MB/s	Average Latency (ms)	Maximum Latency (ms)
0	600	1668	6.52	0.9	40.82
1	600	1653	6.46	0.91	40.64
2	600	1814	7.09	0.8	52.16
3	600	1913	7.47	0.76	37.66
4	600	1921	7.5	0.76	43.25
5	600	1821	7.11	0.8	51.66
6	600	1677	6.55	0.9	39.64
7	600	1668	6.52	0.9	45.17
8	600	1831	7.15	0.79	51.99
9	600	1649	6.44	0.91	43.26

madnote · ‎07-31-2015

FYI. It is still being flagged in the VSAN Health Check.

cdekter · ‎07-31-2015

Thank you for letting me know. The Health Check's internal database should update automatically very soon. I'll keep an eye on it and check with the Health Check team to ensure everything updates correctly.

jonretting · ‎07-31-2015

Which firm/driver combination are you using now, and or with the benchmarks you did? That is a serious latency improvement, sub ms it looks like...and your max lat is magnitudes faster Yay SAS. Was anything else going on in the VSAN while you were benchmarking? Try a benchmark while objects are being synced/policies changes. Keep up the good work. Thanks, -Jon

elerium · ‎07-31-2015

I am pleased to report that as of today the VSAN VCG has been updated to reflect new recommended driver and firmware versions for these controllers, which should resolve the symptoms reported in this thread: VMware Compatibility Guide: vsan

Here are the new recommended versions:

H730 controller series with ESXi 5.5u2
New recommended firmware version: 25.3.0.00016
New recommended driver: megaraid_perc9 version 6.902.73.00-1OEM

H730 controller series with ESXi 6.0
New recommended firmware version: 25.3.0.00016
New recommended driver: lsi_mr3 version 6.606.12.00-1OEM

That is very good news, can you share which issues were resolved specifically?

I had raid controller resets and stalling occuring from the use of a single SATA drive (used only for ESXi scratch and ISO storage) in HBA mode. This stalling would hang or crash hosts in addition to the poorer observed performance while in HBA mode. Are these issues that are resolved in the driver combo above? I currently run this combo but in RAID0 but I would of course be interested in running in HBA for future ease of disk replacement/maintenance.

jonretting · ‎07-31-2015

Sorry if I don't remember but you are talking about SATA for your storage tier correct? Thanks, -Jon

elerium · ‎07-31-2015

In my case, i've never used SATA for the storage tier, I am using SAS, however there is a SATA drive (Dell 500GB magnetic) connected for use as the log/scratch disk since for me ESXi doesn't want that on my boot SD card. I was encountering hangs/PSODs instability from having this single SATA drive in the HBA config. A few weeks ago I rebuilt it all as RAID0 since I was also noticing that HBA mode was noticably slower in benchmarking than RAID0, although this may have been fixed since firmware 25.3.0.0016 wasn't released yet.

Just a pain since I've rebuilt the VSAN twice going from RAID0 to HBA and back and shuffling so many combos of HBA/RAID/driver/firmware settings. Still a great product, the only thing nagging me is not running in HBA which is why I'm interested if this is all fixed now. If so I'd rebuild again to HBA.

jonretting · ‎07-31-2015

That's interesting... Obviously you don't get the PSOD/hangs when running just the SD and no scratch? Is there an onboard SATA controller you could use for both ESXi/sctratch? My reasoning here is rule an assortment of things out. Also I have occasionally run into issues when booting certain machines into ESXi via UEFI. Personally I ditched the USB/SDCARD method a while ago in favor of onboard high-temp SLC SATA DOMs. Thanks, -Jon

elerium · ‎07-31-2015

For the Dell r730xd, the H730 controller would be the onboard controller and any disk being plugged in would need to go through this controller.

I didn't have time to test if just running SD and no scratch. I also didn't have a spare SAS drive to swap with. Based on all the logs and data I've collected, on the older firmware 25.2.2.0004 probably would have not hung/crashed if using all SAS disks connected while in HBA, however HBA mode was still noticeably slower. VSAN observer in RAID0 would show drives maxing between 350-400 IOPS, where in HBA, 275-300 was max for IOPS in addition to controller hangs and all the other bad stuff. Other benchmarking that I did between RAID0 and HBA also showed that HBA was 20-25% slower on this older firmware.

The original firmware 25.2.1.0037 wouldn't even detect my SAS storage drives in HBA so I did very little testing on this.

jonretting · ‎07-31-2015

You wouldn't happen to have a spare AHCI disk controller you could plop in for testing? Tall order i know. So that your ESXi/Scratch SATA is on that, hopefully eliminating something from the equation. Thanks, -Jon

elerium · ‎08-03-2015

Unfortunately don't have a spare disk controller and the cluster is already in use. Am building out another cluster in a month or so with identical hardware but with all SAS drives, will probably test HBA on that buildout.

RS_1 · ‎08-03-2015

As per release_note_lsi-mr3_6.606.12.00-1OEM.600.0.0.2159203.txt : Bugs fixed (compared to earlier release of driver): None Known Issues and Workarounds: None Additional configuration options supported by the driver: None ...

jonretting · ‎08-04-2015

Hmm... Just another idea, what about attaching an iSCSI disk or PXE into ESXi? Thanks, -Jon

cdekter · ‎08-04-2015

Hi Elerium,

The updates should resolve issues that manifest as the controller firmware entering a 'fault state', various IO command aborts, and disks being marked as permanently lost. I'm not aware of a fix for the specific issue you mentioned. However, I should stress that VMware strongly recommends running the controller in the configuration specified on the VSAN HCL - whether that be RAID-0 or HBA mode. For the H730 series, we require that these be configured in HBA mode across the board as this was the mode used to do certification testing for these controllers.

J1mbo · ‎06-01-2016

‌cdekter‌, with 6u2 what is the intended behaviour of guest issued SCSI device reset command?

It seems that at least with H730 controller, these commands are making it to the physical controller. I wondered if this might be an unintended side effect of something that has been changed because of VSAN.

cdekter · ‎06-01-2016

VSAN will not pass through any device reset commands originating from VMs. What you are observing is most likely originating from other VMFS volumes (e.g. ones used for storing ESX logs) on disks attached to the H730 controller. At this date it is not supported to run any virtual machines on VMFS volumes alongside VSAN on the same controller.

J1mbo · ‎06-01-2016

Thanks, correct it is in connection to VMFS volumes (on their own; no VSAN running) but the resets do seem to be being passed from guest to hardware.

Eg running sg_reset -d /dev/sda within a Linux guest running on an H730 provided internal datastore - or just rebooting a Linux guest - whilst some competing workload is working on that same datastore originating from another VM will cause the array IOPS to drop to zero for 5-20 seconds on that host. This is 100% repeatable with this controller.

I just wondered if this might be related to the work that has been going on with this controller in connection with VSAN - quite a flurry of firmware and driver updates for it recently.

All

Dell PERC H730p / LSI 3108 /Invader implementations