VMware Cloud Community
J1mbo
Virtuoso
Virtuoso

Dell PERC H730 mini, ESXi 6u2 - Latency, Timeouts, Resets

Ongoing problems with this combination with VSAN, but I'm finding (likely related) issues with Dell PERC H730 mini controller used to provide simple internal datastores - the controller appears to stall periodically with commands taking as much as 30 seconds to complete. Needless to say the machines are crippled performance-wise.

Some relevant entries from vmkernel.log:

2016-05-30T17:07:10.033Z cpu0:33230)ScsiDeviceIO: 2613: Cmd(0x439dc0c328c0) 0x2a, CmdSN 0x2b8b from world 32780 to dev "naa.6141877059d6b1001e6327a60b2ff1b7" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2016-05-30T17:07:10.257Z cpu5:36702)WARNING: VSCSI: 3697: handle 8201(vscsi0:0):WaitForCIF: Issuing reset;  number of CIF:1

2016-05-30T17:07:10.257Z cpu5:36702)WARNING: VSCSI: 2627: handle 8201(vscsi0:0):Ignoring double reset

2016-05-30T17:07:11.309Z cpu5:32794)NMP: nmp_ThrottleLogForDevice:3231: last error status from device naa.6141877059d6b1001e6327a60b2ff1b7 repeated 2560 times

...

2016-05-30T17:07:15.526Z cpu3:36496)lsi_mr3: fusionWaitForOutstanding:2655: megasas: [ 5]sec waiting for 1 commands to complete

2016-05-30T17:07:15.529Z cpu3:36496)VSCSI: 2661: handle 8201(vscsi0:0):Completing reset (0 outstanding commands)

2016-05-30T17:07:17.169Z cpu0:32807)lsi_mr3: mfi_TaskMgmt:313: Processing taskMgmt virt reset for device: vmhba0:C2:T0:L0

2016-05-30T17:07:17.169Z cpu0:32807)lsi_mr3: mfi_TaskMgmt:317: VIRT_RESET cmd # 321473

2016-05-30T17:07:17.169Z cpu0:32807)lsi_mr3: mfi_TaskMgmt:321: ABORT

2016-05-30T17:07:17.169Z cpu0:32807)lsi_mr3: fusionWaitForOutstanding:2655: megasas: [ 0]sec waiting for 2 commands to complete

2016-05-30T17:07:20.035Z cpu4:32891)HBX: 2802: 'datastore1': HB at offset 3227648 - Waiting for timed out HB:

2016-05-30T17:07:20.035Z cpu4:32891)  [HB state abcdef02 offset 3227648 gen 11 stampUS 808213540 uuid 574c7033-8f63506c-f92b-44a8424a559e jrnl <FB 309806> drv 14.61 lockImpl 3]

Plus state in doubt entries as well. We've observed this on various Rx30 machines with RAID-10 and RAID-50 configurations. Read-ahead/write-back policy.

This is with vanilla ESXi 6 update 2 (3620759), lsi-mr3 version 6.903.85.00-1OEM driver 25.4.0.0017 firmware. System BIOS and other firmware is all current. All firmware has been re-flashed.

0 Kudos
4 Replies
YanSi
Enthusiast
Enthusiast

I also thought it might be driven or firmware problem.

I also lsi_mr3 error problem in a VSAN environment.

Dell R730xd on VSAN 6.2 Boot Hung at "vmkfbft loaded successfully"

0 Kudos
J1mbo
Virtuoso
Virtuoso

It turns out that the issue is within the lsi driver, and a solution is to roll back to the ESXi5.5 megaraid_sas driver - even with controller firmware 25.4.0.0017 on 6.0u2. This also applies to using the Dell customised ISO too.

With the lsi driver, any guest can cause the above log entries and ultimately DoS an ESXi 6 host running with this controller by simply continually issuing SCSI reset commands, for example using linux sg_reset utility. Rebooting a Windows or Linux guest will also cause competing workloads to experience delays of up to 30 seconds in their IO stacks.

This is a serious issue and I've also seen first-hand unexplained array corruption (read: destruction) with the lsi driver on 6.0u2.

0 Kudos
gadvdi
Contributor
Contributor

Hi I am having this problem with esxi6.0u2, Perc H730 mini firmware 25.5.0.0018 and the lsi mr3 driver 6.904.43.000.  Can you provide more details about how to roll back to the ESXi5.5 megaraid_ sas driver?

0 Kudos
J1mbo
Virtuoso
Virtuoso

Sorry only just seen this.First install scsi-megaraid-perc9 driver, I'm using 6.901.57.00-1OEM.550.0.0.1331820 but there is a slightly newer version, both are available from the HCL downloads page here. To do this, SSH onto the host, copy the downloaded ZIP onto the box e.g. to /tmp/megaraid.zip, then use esxcli to install thus:

esxcli software vib install -d /tmp/megaraid.zip

After that, disabled lsi_mr3 driver:

esxcli system module set --enabled=false --module=lsi_mr3

Then reboot the host and re-test. Hope that helps.

0 Kudos