VMware Cloud Community
dessert_first
Contributor
Contributor

Frequent Disk Latency Spikes - Dell R730xd - PERC H730p Mini - ESXi 6.7u3

Hello!  Long time listener, first time caller... I've been looking at one disk issue for a couple weeks now and I am still stumped.

 

(Edit: I figured choosing "Storage Discussion" would be a good place to start with this storage issue, but I wasn't aware it would place this under the "Storage Appliance" section)  

 

We're seeing frequent, almost timed/measured spikes in local disk latency on an R730 running 6.7u3 standalone (no vCenter). I have included screenshots of the graphs showing the spikes, along with most of the disk and controller related info from the idrac in the screenshot attachments..

 

I can’t say when it started.  The box was built in August 2022, this is a fresh install of the dell customized ESXi image.  We only noticed these spikes about three weeks ago, February 2023.  They may have been occurring since install, or it may be a more recent development.

 

Config: Dell R730xd - Perc H730p Mini - ESXiu3 - RAID10 – 12x 600gig SAS 15k RPM - Disks are all identical 

 

Perc Firmware/Driver: 25.5.9.0001 / 7.719.02.00 Write-Back

 

Every firmware and driver and bios have been checked against the compatibility matrix by three of us here, and it all checks out as far as all of us can tell.  Everything is at the most recently supported versions across the board.

 

Spikes are always either around 90ms, or around 180ms, and seem to appear once or twice per minute in the hosts vSphere web client (Monitor > Performance > Disk) - Note that this is the only place that we are observing spikes.

 

Spikes occur regardless of VM status - powered up, powered down, and even with ESX in maintenance mode.

 

With VM's powered on, when generating a ton of disk activity, we see spikes of about 8ms in the individual VM's performance graphs. This seems accurate and acceptable.  We see the same numbers in a Windows perfmon when generating a bunch of disk activity in that VM. The spikes in the host performance graphs do not seem to be affected – they stay frequently spiking to 90ms or 180ms.

 

I found an old thread from 2011 that did appear to show the exact same symptom in ESX 5, however no solution was provided in that thread: https://communities.vmware.com/t5/ESXi-Discussions/High-Latencies-on-idle-ESX5-Host/m-p/2647250

 

We don’t see this anywhere else.  We run a dozen other ESXi hosts, all R710 and 6.7u3, all have been upgraded to this version over the years (though there might be one other 6.7u3 clean install in the mix). All have acceptable and accurate-appearing disk latency values in every graph across the board.

 

I am hoping that this is an easy diagnosis, or that maybe this is just a cosmetic bug of some sort, but I’m just not yet finding solid solutions or explanations that fit for this specific condition. 

 

Yesterday, chat supported directed us to a form on vmware.com to reach out to sales to find out if there are any pay-per-incident technical support options remaining for 6.7u3 – we foolishly didn’t extend support, not my decision fwiw -  Anyway, that was over 24 hours ago and we haven’t heard from them yet, so I figured I'd reach out here as well. 

 

If you made it this far, thanks for reading, we welcome any thoughts or ideas, and we can provide any additional information requested. 

 

Edit: Added two files: latency-r710.txt and latency-r730.txt

These are about fifteen minutes of the vscsiStats latency numbers for the R730xd in question, and one from one of our R710's as a sort of a "control" sample.  I have not needed to dive quite this deep before, so now I'm off to learn what these mean and if I can learn anything more from them. 

0 Kudos
0 Replies