Repeated Datastore Disconnection Every 23 minutes

lansol · ‎11-13-2019

We are troubleshooting an issue where our VMs are hanging for 5-10 seconds every time our datastores disconnect and reconnect.

The load on the server is minimal with a little bit of I/O. The disconnections happen every 23 minutes whether there is a lot of IO or not.

There are three datastores. If all guest VMs are powered off except for one VM which only has VMDK files on ONE datastore, the controller will still show large IO spikes, however it may not fully reset all datastores. During business hours or at night during backups where there is constant RW operations going on, the datastores will always reset and come back. We are receiving some application errors due to this in one of our databases.

Dell PowerEdge R540 (no cluster)

PERC H730P Adapter (Embedded)
Firmware: 25.5.6.0009 (Latest)
4 x 2TB 7.2k RAID 6
2 x 200GB SSD RAID 1
2 x 600GB 15k RAID 1
All volumes have Read-Ahead and Write-Back enabled

ESXi 6.7.0 Update 3 Build-14320388 (A00)

No snapshots in place
Driver version 7.708.07.00-3vmw (original driver)
Driver version 7.710.07.00-1OEM.670.0.0.8169922 (*)
VMFS3.UseATSForHBOnVMFS5 is set to default (1). We tried setting value to (0) with no improvement.

* This driver is supported only on Dell PowerEdge Servers R6525, C6525, R6515 and R7515

* After contacting Dell, they recommended installing the above driver to test as the changelog indicated it addressed my symptoms. No improvement, however

* SCGCQ02033302 Resolves issue in which non-RAID drives may not be listed during OS installation or in vSphere.

* SCGCQ02189085 Fixed an issue that could cause an IO timeout and controller reset under certain workloads.

As mentioned above, this happens EVERY 23 minutes.

If anyone is able to offer assistance I would be forever grateful.

daphnissov · ‎11-14-2019

As a test, I would disable write-back cache and see if it makes any difference. Your overall latency may be higher, but maybe there's something in that controller's microcode that is generating this.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

lansol · ‎11-15-2019

Thanks for the reply. We didn't try changing the writeback policy.

Changing firmware and driver versions in the 2019 time frame didn't seem to make any difference.

Astoundingly, the solution was to downgrade the firmware of the iDRAC on our server to a version from December 2018. Apparently, the iDRAC controller polls the storage controller on a schedule and that was causing the datastore disconnections.

Dell supposedly will have an updated iDRAC firmware out in December 2019 that should fix this issue.

jezh · ‎04-17-2021

Hi Iansol,

We have this same issue, did you ever upgrade the idrac firmware and the issue is still no longer there? Or are you still on the 2018 version?

Thanks

e_espinel · ‎04-17-2021

Hello.
Very rare your case, the IDRAC is management only.
Who updated the Frimware you or DELL support ?

Integrated Remote Access Controller (iDRAC) Service Module is an optional lightweight software application that can be installed on Dell servers of the 12th generation or higher with iDRAC7. It complements the iDRAC interfaces: Graphical User Interface (GUI), Remote Access Controller Administration (RACADM), CLI and Web Services Management (WSMAN) with additional monitoring data.

Enrique Espinel
Senior Technical Support on IBM, Lenovo, Veeam Backup and VMware vSphere.
VSP-SV, VTSP-SV, VTSP-HCI, VTSP
Please mark my comment as Correct Answer or assign Kudos if my answer was helpful to you, Thank you.
Пожалуйста, отметьте мой комментарий как Правильный ответ или поставьте Кудо, если мой ответ был вам полезен, Спасибо.

jezh · ‎04-18-2021

Hi,

Our idrac firmware was the same version as the OPs. We had exactly the same issue, vmware logs showing datastore disconnects every 23 minutes. Yesterday we updated the firmware to the latest and now there are no disconnects. Also the same applies to the Windows logs, before we had delay writes and virtual machines freezing, which has now gone.

Thankful the OP added 23 minutes to their post otherwise it would never have led me here.

Thanks

jezh · ‎04-19-2021

We still seem to have an issue, but it is a different error now. But still causing the same problem.

Device naa.Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx performance has deteriorated. I/O latency increased from average value of 7354 microseconds to 235438 microseconds

Device naa.Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx performance has improved. I/O latency reduced from 46157 microseconds to 14558 microseconds

This appears to be occurring since OpenManage was installed, but different error since the idrac update 2 days ago.

Before the drop out was 3-5 seconds and causing bigger issues, but we still had to reboot VMs today.

All

Repeated Datastore Disconnection Every 23 minutes