Cryptz
Enthusiast
Enthusiast

ESXi 6.5 Slow vms, High "average response time"

I am running esxi 6.5 with the latest patches and vmware tools 10.1.5

I am having very inconsistent performance issues with both of my hosts. Basically the windows 2016/windows 10 guests are sluggish at times. nothing will load and the os is basically unresponsive when interacting with the gui. The issue seems to be stemming from disk performance but I am not 100% certain that this is the cause, it may be a side affect.

What I have noticed is that some vms show a average response time for the disk of about 2000ms. Yet if i check the performance monitor at a host level the disk and datastores are all showing sub 1ms response time. I am not able to explain the inconsistencies there.

I have a local ssd datastore on each host as well as a rather fast nvme iscsi san that is connected via 100gb mellanox connectx4 cards. I see the issue with both hosts and both datastores. The issue seems to be worse now with the most recent patches and vmware tools drivers. I am using vmxnet3 network cards and paravirtual scsi controllers on all vms.

I have run disk benchmarks on the vms and the resutls vary. I have already seen it where i run a disk benchmark on a guest, get horrible results, vmotion it to the other host, and benchmarks to the san are fine, and then i vmotion the guest back to the original host and the results are fine the second time I run it.

here is an example of a bad test, the reads are terrible:

-----------------------------------------------------------------------

CrystalDiskMark 5.2.0 x64 (C) 2007-2016 hiyohiyo

                           Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 2) :     0.655 MB/s

  Sequential Write (Q= 32,T= 2) :  5384.173 MB/s

  Random Read 4KiB (Q= 32,T= 2) :     0.026 MB/s [     6.3 IOPS]

Random Write 4KiB (Q= 32,T= 2) :   617.822 MB/s [150835.4 IOPS]

         Sequential Read (T= 1) :     2.306 MB/s

        Sequential Write (T= 1) :  1907.004 MB/s

   Random Read 4KiB (Q= 1,T= 1) :    53.942 MB/s [ 13169.4 IOPS]

  Random Write 4KiB (Q= 1,T= 1) :    52.104 MB/s [ 12720.7 IOPS]

  Test : 50 MiB [C: 5.2% (15.6/299.5 GiB)] (x1)  [Interval=5 sec]

  Date : 2017/03/25 20:29:18

    OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

 

a few seconds later on the same setup i get perfectly fine results:

-----------------------------------------------------------------------

CrystalDiskMark 5.2.0 x64 (C) 2007-2016 hiyohiyo

                           Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 2) :  6655.386 MB/s

  Sequential Write (Q= 32,T= 2) :  5654.851 MB/s

  Random Read 4KiB (Q= 32,T= 2) :   695.193 MB/s [169724.9 IOPS]

Random Write 4KiB (Q= 32,T= 2) :   609.216 MB/s [148734.4 IOPS]

         Sequential Read (T= 1) :  1810.393 MB/s

        Sequential Write (T= 1) :  1626.112 MB/s

   Random Read 4KiB (Q= 1,T= 1) :    53.266 MB/s [ 13004.4 IOPS]

  Random Write 4KiB (Q= 1,T= 1) :    54.289 MB/s [ 13254.2 IOPS]

  Test : 50 MiB [C: 5.2% (15.7/299.5 GiB)] (x1)  [Interval=5 sec]

  Date : 2017/03/25 20:32:21

    OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

115 Replies
psmith
Contributor
Contributor

We've tested the @galbitz_cv suggested change on two of our clusters (one connected via ISCSI, the other via NFS) and it appears to workaround the problems for us.  Hosts are now patched to current levels and guests keep a normal transfer speed and latency.

Thanks!

0 Kudos
galbitz_cv
Contributor
Contributor

Cool, can anyone else confirm? Just curious did you ever have any linux machines affected by this?

0 Kudos
jaylik
Contributor
Contributor

Linux machines are not affected by this bug but (dont know if youd use discards on mtab)...

191201 – Randomly freezes due to VMXNET3

0 Kudos
sessionx
Enthusiast
Enthusiast

Interesting fix, the registry change looks to disable TRIM support.

How will this affect Optimize and Defrag to reclaim space when it comes to Thin Provisioned disks?  Is there some compatibility issue between thin provisioning and the OS TRIM features?

I also found a blog that mentions to set the same registry key to 1 to resolve performance issues with Windows Storage Spaces

https://infratechy.co.uk/2014/03/30/windows-server-2012-configure-local-storage/

Has VMware indicated if this is a bug that will be fixed?  Is the registry tweak simply a workaround for now?

0 Kudos
galbitz_cv
Contributor
Contributor

My take is that vmware asked me to set the registry key as a way to narrow the problem down. They suspected the issue was a result of the unmap changes made in the patches but they were not sure. The key is really just a way to confirm that feature is what is causing the issue. I am not sure they are close to a fix, every conversation I have had indicates they cannot reproduce this issue despite the rather large amount of people reporting the issue in this thread..

sessionx
Enthusiast
Enthusiast

psmith - have you noticed an IOPS decrease on your back end storage?  Our biggest issue right now is we have a lot of IOPS showing up on our SAN but nowhere else in any other metrics to match it, so the response time is bad and we're getting timeouts.  We're going to try this registry fix to see if it helps and post our results.

0 Kudos
llacas
Contributor
Contributor

We also experienced the performance problems on the VM's that are on our all flash Compellent and the fix seems to have fixed the issue. I did the registry fix on 10 VM's, all Windows 2012 R2 and so far so good.

0 Kudos
sessionx
Enthusiast
Enthusiast

I have the same results, making the registry change brings latency from 4000-7000ms in performance monitor back down to normal.

What is unusual is in the VMware metrics or vRealize it doesn't show this latency, everything looks fine.

0 Kudos
psmith
Contributor
Contributor

@sessionx - No, no impact on IOPS on the backend.

0 Kudos
M_Wingenfeld
Contributor
Contributor

Same issue in a complete new ESXi 6.5 environment with two Dell R630 with latest Dell image (build 6.5 5310538), Dell PowerVault Storage and Windows Server 2016 VMs. VMware Storage performance showed latency ranging from 0-10 ms but Server VMs had weird hangs and wait times, while CPU and memory utilization was low. Resource monitor showed 300-500ms latency on storage.A P2V Server 2008R2 VM does not have this issue. After implementing the registry key "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem\DisableDeleteNotification" set to 1 the "issues" are gone.

I will now open a support ticket with VMware.

GraintecKEC
Contributor
Contributor

I am having similar issues.

HP 380 Gen9

32GB RAM

5 x 600GB 10k SAS discs

I have installed the latest version of VM ware esxi 6.5 on the server and created a vm with windows 2016 .

everything is just running so slow, the server is working but slow response for everything - just opening Group policy management take 5-10 seconds. Only able to copy 10mb/sec to the server

I tried changing that regedit setting mentioned here, no effect at all.

Anyone found a solution that do not require me to reinstall 6.5 completly?

Kind regards Kenn

0 Kudos
galbitz_cv
Contributor
Contributor

I would double check your scenario. The registry setting has fixed all occurrences of this so far. I would suspect you are running into something else, assuming you rebooted after making the registry change.

0 Kudos
GraintecKEC
Contributor
Contributor

I did not add the registry key as it all ready existed. it was however set to 0 - so i changed it to 1 and rebooted.

I dont feel that anything changed, everything is still running like stuck in glue on the server.

can it be something with the version of my VMWARE? should i attempt to upgrade it to lastest build?

0 Kudos
kasperbj
Contributor
Contributor

Hi all. I have now been working with VMware support team for the last 3 month, and for the last week I have been testing a Bugfix-patch that eliminated the problems.

In a couple of weeks they will come with 6.5 U1 where the fix is included. Here's a short statement from VMware:

VMware Engineering have Root Caused this issue as to how the unmap (Block Space Reclamation) was working on 6.5.
The fix for this has been completed and will be included in the next ESXi release 6.5 U1. This release is currently on schedule to be available next month.

kman10
Contributor
Contributor

Thanks for the update. You got much better traction with your case.

0 Kudos
AndreBusse
Contributor
Contributor

Same problem here.

New Environment for a customer.

2x Fujitsu RX2540M2 Servers with internal HDD's and SSD's on LSI Megaraid.

Change registry solves the problem.

Fujitsu-VMvisor-Installer-6.5-5146846-v401-1 (Fujitsu)

0 Kudos
agervasoigatech
Contributor
Contributor

Hi, I've experienced the same issue on a VNX 5200 storage and UCS Blades, so I had to rollback the esx updates.

Looks like the solution is behind the corner so I think I'll wait for the 6.5 U1 upgrade later in July.

Thanks everyone for sharing your experience with us.

0 Kudos
kasperbj
Contributor
Contributor

Hi all

Just got an update from VMware, and look like a Hot Patch will be send out today with a KB 😉

The wait is soon over.

Form VMware engineering team:

The Hot Patch is still being finalised, so almost complete.  We are also finishing a KB which will be published publicly once complete which will outline this behaviour and advise of the Fix and the workaround.

sessionx
Enthusiast
Enthusiast

Can you find out if we need to revert the registry workaround after the hotfix is applied?  What is the impact if we don't put the registry setting back?

0 Kudos
weijoh
Contributor
Contributor

Hi kasperbj,

Since you already received the hot fix and it is still not published, I would ask you if you would be so kind to send me that patch.

Just contact me via PM - sorry I wasn't able to... There just showed up a red bar without any error message in three different browsers...

​And of course I do have some Questions:

Did you already install that hot fix?

If yes, did you recognize any improvements?​

​Last but not least, do I have to revert that registry workaround?

Thanks!

0 Kudos