This is my first post.
We are facing storage problem on SQL Server. Well it is not a problem as such. But our SQL DBA are showing me latency in the range from 4000 ms to 8000 ms.
But at storage (we have all flash storage) there is no KAVG, GAVG. These are completely normal.
I then noticed we are not using pvscsi. I change to pvscsi. The latency has not dropped a lot. Day:01 is dropped but later on it came to normal level.
I also increase pvscsi queue depth inside the OS.
What else would you recommend?
Did I missed anything else?
Strange as it may sound, there no latency either at storage or ESXTOP level
Your settings will be dependent on your storage vendor, and even firmware levels. Your ESX version may also require some different settings.
We use primarily HPE, 5 3Par's and 1 Primera and are running vSphere 7u1 and ESX 7u1. For those we were told the following:
Set IOPS to 1 and Round Robin using a custom SATP ALUA rule.
Set Disk.QFullSampleSize = 32
Set Disk.QFullThreshold = 4
Set VMFS3.UseATSForHBOnVMFS5 = 0 (ATS HeartBeat)
Check your HBA firmware and drivers, not just ensuring they are up to date, but also the same amongst different hosts.
These are the main settings we needed to look at, and are unique to our environment, but they may give you some ideas to look at..
Thank you so much for taking time on replying to my message.
We are already following our Storage vendor best practices.
and We have set the queue depth at ESXi and PVSCSI level at 254 following this article
If you have no latency on the vSCSI level (GAVG) but in the Guest / App, the most likely explanation is filter drivers. Did you exclude the DB files from whatever antivirus you have configured? You can check the running ones with fltmc (from my laptop, not a VM):
C:\Windows\system32>fltmc Filter Name Num Instances Altitude Frame ------------------------------ ------------- ------------ ----- WdFilter 9 328010 0 storqosflt 0 244000 0 wcifs 0 189900 0 CldFlt 6 180451 0 FileCrypt 0 141100 0 luafv 1 135000 0 npsvctrig 1 46000 0 Wof 7 40700 0 FileInfo 9 40500 0
If you want to assess the actual impact, you need to record an ETL trace that looks at how much time is spent in each filter.
You need the WPT, the easiest way to install that is to download etwpackage.zip from https://github.com/google/UIforETW/releases on any client, run bin/UIforETW.exe. This will download and install the most up to date WPT to the local machine (to: c:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\ , even on 64bit OS), you can then copy the WPT folder to the target VM.
While you are observing the latency, run:
xperf -on PROC_THREAD+LOADER+HARD_FAULTS+DPC+INTERRUPT+CSWITCH+PROFILE+FLT_IO_INIT+FLT_IO+FLT_FASTIO+FLT_IO_FAILURE+FILENAME -stackwalk profile+MiniFilterPreOpInit+MiniFilterPostOpInit -buffersize 1024 -minbuffers 1024
After ~30 seconds stop and safe:
xperf -stop -d descriptive_filename.etl
The analysis is well described in other places, the file itself collects way too much info about your machine to be shared on some online community.
TL;DR if there is no observed latency at the vSphere level, the issue is most likely in the guest OS
All I'm saying is that if you don't see queuing in the guest (PVSCSI controller / disk) and no latency on the vSCSI (~GAVG) level, you should be concentrating your troubleshooting in the guest. Eliminating filter drivers is a good first step, you could of course also first look at storport observed latency directly. There are many different approaches but the guest OS / app is the most likely place where the additional latency is "hiding". I'm a big fan of the WPT and the best guides are IMO written by Bruce Dawson, start e.g. here: https://randomascii.wordpress.com/2012/11/21/the-lost-xperf-documentationdisk-usage/
Old but a good primer regardless: https://web.archive.org/web/20120603054022/http://blogs.technet.com/b/robertsmith/archive/2012/02/07...
There is also a storeport trace "display" tool if you prefer that: https://www.sqlservercentral.com/blogs/storport-reading-an-etl-trace