Cryptz
Enthusiast
Enthusiast

ESXi 6.5 Slow vms, High "average response time"

I am running esxi 6.5 with the latest patches and vmware tools 10.1.5

I am having very inconsistent performance issues with both of my hosts. Basically the windows 2016/windows 10 guests are sluggish at times. nothing will load and the os is basically unresponsive when interacting with the gui. The issue seems to be stemming from disk performance but I am not 100% certain that this is the cause, it may be a side affect.

What I have noticed is that some vms show a average response time for the disk of about 2000ms. Yet if i check the performance monitor at a host level the disk and datastores are all showing sub 1ms response time. I am not able to explain the inconsistencies there.

I have a local ssd datastore on each host as well as a rather fast nvme iscsi san that is connected via 100gb mellanox connectx4 cards. I see the issue with both hosts and both datastores. The issue seems to be worse now with the most recent patches and vmware tools drivers. I am using vmxnet3 network cards and paravirtual scsi controllers on all vms.

I have run disk benchmarks on the vms and the resutls vary. I have already seen it where i run a disk benchmark on a guest, get horrible results, vmotion it to the other host, and benchmarks to the san are fine, and then i vmotion the guest back to the original host and the results are fine the second time I run it.

here is an example of a bad test, the reads are terrible:

-----------------------------------------------------------------------

CrystalDiskMark 5.2.0 x64 (C) 2007-2016 hiyohiyo

                           Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 2) :     0.655 MB/s

  Sequential Write (Q= 32,T= 2) :  5384.173 MB/s

  Random Read 4KiB (Q= 32,T= 2) :     0.026 MB/s [     6.3 IOPS]

Random Write 4KiB (Q= 32,T= 2) :   617.822 MB/s [150835.4 IOPS]

         Sequential Read (T= 1) :     2.306 MB/s

        Sequential Write (T= 1) :  1907.004 MB/s

   Random Read 4KiB (Q= 1,T= 1) :    53.942 MB/s [ 13169.4 IOPS]

  Random Write 4KiB (Q= 1,T= 1) :    52.104 MB/s [ 12720.7 IOPS]

  Test : 50 MiB [C: 5.2% (15.6/299.5 GiB)] (x1)  [Interval=5 sec]

  Date : 2017/03/25 20:29:18

    OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

 

a few seconds later on the same setup i get perfectly fine results:

-----------------------------------------------------------------------

CrystalDiskMark 5.2.0 x64 (C) 2007-2016 hiyohiyo

                           Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 2) :  6655.386 MB/s

  Sequential Write (Q= 32,T= 2) :  5654.851 MB/s

  Random Read 4KiB (Q= 32,T= 2) :   695.193 MB/s [169724.9 IOPS]

Random Write 4KiB (Q= 32,T= 2) :   609.216 MB/s [148734.4 IOPS]

         Sequential Read (T= 1) :  1810.393 MB/s

        Sequential Write (T= 1) :  1626.112 MB/s

   Random Read 4KiB (Q= 1,T= 1) :    53.266 MB/s [ 13004.4 IOPS]

  Random Write 4KiB (Q= 1,T= 1) :    54.289 MB/s [ 13254.2 IOPS]

  Test : 50 MiB [C: 5.2% (15.7/299.5 GiB)] (x1)  [Interval=5 sec]

  Date : 2017/03/25 20:32:21

    OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

115 Replies
vrod1992
Contributor
Contributor

Did anyone fix this? Having this issue at many several machines, different hardware and different types of datastore (SATA SSD, NVME, iSCSI). Does anybody know if VMware are working on a fix?

0 Kudos
galbitz_cv
Contributor
Contributor

I have a PR open for this (not sure what it stands for, maybe product review, but it is an escalation of a service ticket). Its been about 3 weeks. I tried pinging my contact on friday but as of now have not heard back yet.

0 Kudos
vrod1992
Contributor
Contributor

3 weeks? Holy moly that's some time to wait.. I am also thinking about opening a ticket about this.. For me the issue suddenly comes, disk gets slow. Moments later its gone but comes back again. Could this maybe be because of a windows update? Just triying to air out some ideas here...

0 Kudos
hyvokar
Enthusiast
Enthusiast

Is there any response from vmware? Can someone confirm if ESXi650-201704402-BG (KB2149715) is affected as well?

0 Kudos
galbitz_cv
Contributor
Contributor

My case is still open and other then sending log files I havent had any contact with support. They seem to be having problems duplicating the issue. After a few weeks of the PR being opened (after case escalation) they came back and asked me how I was testing for the latency which I found funny since the vms are basically unusable when the issue occurs. But i do see high disk latency in process explorer which is what I told them. I would suggest anyone with the issue opening a case as well and provide them logs just so they have more data to resolve the issue.

Since my case is still open I would suspect any future update has the issue as well. Very unlikely a fix would have been created for an issue they cannot even duplicate.

My ticket is 17417329003, you should probably reference this if you create your own.

0 Kudos
hyvokar
Enthusiast
Enthusiast

Thanks for the update galbitz, I'm upgrading from 6.0 to 6.5 so I'll just stay in earlier version.

0 Kudos
galbitz_cv
Contributor
Contributor

anyone still running an affected version you may want to try the following in a ssh session on the affected host:

esxcli system settings kernel set --setting=iovDisableIR -v FALSE

This is the solution proposed by vmware. I need to update a host to test, I have not yet tried this.

0 Kudos
psmith
Contributor
Contributor

Just tested this on one of our Dell R710s that's affected.  Ran the command, then rebooted the host.  Doesn't appear to have made a difference, sadly.

For testing, we're running MyDefrag 4.3.1 and watching the response time column in Resource Monitor.  When running on an unpatched host, the Response Time stays below 5ms.  As soon as I vMotion that guest to a patched host, it's jumping up to 500ms.

Bummer, was hoping this would fix things!

0 Kudos
lwnetinsight
Contributor
Contributor

I'm having exactly the same issues as you guys are experiencing. I have awful experience with Windows 2016 Server VM, really unusable.

I have also just tried the above command and rebooted a host and it makes no difference.

I am going to try rebuilding one of my hosts back to a base 6.5 esxi and see what the results are tomorrow as all 4 of my hosts in my lab are at the latest 6.5 patch levels with the affected patches installed.

Is there a simple way I can just remove the affected patches, I installed them via Update Manager?

If not I will just rebuild a host tomorrow.

0 Kudos
CoryIT
Enthusiast
Enthusiast

I had a quick read through this post, as I was also having some performance problems that "felt" disk related on a brand new VM.

For ref, the VM is hardware version 13, built on a 6.5 host running build 5310538

It was installed with VMware tools 10.1.5, and using a PVSCSI controller.

Not sure what controllers you guys are using, so this may not be it, but I found that this VM was using the 1.3.8 pvscsi driver, but an older unaffected VM was 1.3.4.

I manually rolled the VM back down to that pvscsi driver version, and so far it does seem better!

LuftHansiDK
Contributor
Contributor

We're seeing the exact same thing on an HP Proliant ML350Gen9 (newest Proliant firmware - 201704) running vSphere 6.5.0, Build 5310538 (upgraded to the latest VMware patches - ESXi650-201704001 - yesterday, no effect).

We're running three Windows 2016 Server VMs (also fully updated). I think we have the issue on all three, but one of the symptoms - low throughput - only appears to be affecting two of them.

On our VM1 we get up to around 120MB/sec throughput. On the two other VMs, we only get around 10-12MB/sec throughput max. Beyond that, the symptoms include disk busy time hitting 100% regularly, queues higher than 5, and latencies in the thousands of ms. But the symptoms only affect the VMs themselves - like everybody else here, our hardware status looks fine at the host level.

Therefore, to me it "smells like" an issue with the interfacing between the VM and the virtual disk controller. We are using the LSI Logic SAS SCSI controller - is everybody else running that too? We are also running VM hardware version 13.

For that reason I was wondering if anybody has checked whether other virtual SCSI adapters (like SAI Logic Parallel or BusLogic Parallel) also have the same problem? Beware that its probably a bad idea to just change the controller type on an existing VM (at least for the boot drive), as the drivers need to be installed on the OS and implemented as boot drivers - I have not experimented with this myself, I just read it.

I also tried uninstalling an reinstalling VMware tools - no effect. This is a production environment, so getting a solution ASAP is critical.

0 Kudos
LuftHansiDK
Contributor
Contributor

Hi Cory,

That sounds to me like the exact cure I am looking for while waiting for VMware to do something about this serious problem. Does that solution still appear to have done the trick?

If so, could you please detail exactly how you "manually rolled the VM back down" to a previous version of the SCSI driver? We are using the LSI Logic SAS SCSI controller.

Br,

Hans

0 Kudos
CoryIT
Enthusiast
Enthusiast

Hi LuftHans,

Unfortunately this didn't stick.

I was also using the PVSCSI controller, and not one of the normal SAS controllers.

By downgrading, essentially all I did was mount an ISO from an older version of VMware tools, and then from device manager I browsed to the drivers folder on the disk, and told the controller to use the version of the driver in that folder.

I am hoping that this is something VMware have picked up on and are working to resolve soon, although I have opened a support case on the matter as well, just to try and get it confirmed.

James

0 Kudos
LuftHansiDK
Contributor
Contributor

Hi James,

Thank you very much for the update.

What a bummer! So, the only two solutions as I see it right now are...

1) "Downgrade" to the original release of 6.5 (not sure how to do that easily - are you?)

or...

2) Wait for VMware to release a fix

Is that your take also?

Please let me know if and when you hear more from VMware on the matter - or if you find any other way to solve this. I'll do the same, of course.

I'm surprised that there aren't an overwhelming number of hits on Google for this subject yet, but it's a good bet that they are coming, seeing how we are using standard hardware on the HCL together with mainstream software (Win2016 Server).

Br,

Hans

0 Kudos
Jeffno
Contributor
Contributor

Hi All,

Same problem here.

DL360 G9 Latest HPe SPP

Vmware ESX 6.5

Lagging VM's

6x Server 2016

3x Server 2012r2

Latest Vmware tools. Latest HPe custom ISO (May)

At first we thought it was a problem of the Application server but than it manifested to the other servers too.

Lagging in response very slow througput. Server 2008R2 doens't seem to be affected directly but indirectly through the lagging Application Servers (2012r2)

A reinstall with an earlier build will do the trick?

Come on VMWare help us out.

0 Kudos
CoryIT
Enthusiast
Enthusiast

Well I can't say for certain if a downgrade will help at all. In my case, I can't even attempt that, as all my VMs that are affected are now HW version 13, needing them to run on ESXi 6.5.

I could of course reinstall a host on an earlier build of 6.5, but that's a fair bit of work for potentially no change.

In my case I went straight from 6.0 to the current build of 6.5, so I am unsure if I would have had this problem on earlier builds of 6.5.

I am now engaged with VMware support, so we will find out soon I hope.

James

0 Kudos
galbitz_cv
Contributor
Contributor

James,

A downgrade/reinstall will fix the issue for you, the problem was introduced in the following patches:

ESXi650-201703401-BG - initially

ESXi650-201703410-SG - cumulative, still contains the problem. There have been other patches since then, but none seem to fix the problem.

0 Kudos
galbitz_cv
Contributor
Contributor

Cory are you saying the pvscsi driver downgrade did not fix the problem for you? I am also running pvscsi.

Can you guys please post your ticket #s here? I would like to give them to vmware. I opened this months ago and they are acting like they cannot reproduce the issue. I would think it would be benefitial to give them all of the other case numbers. Maybe they can figure out why it is affecting us and not everyone.

my ticket is# 17417329003

0 Kudos
CoryIT
Enthusiast
Enthusiast

Hi Galbitz,


So do we know what changed in that original patch that could have had this effect? I will have a look at the release notes also to see if there's anything in there.

That's right, the pvscsi downgrade didn't seem to help. I thought it did initially, but as the latency/lag seems to come and go (I didn't know that to start with), I just hit a good patch after the downgrade.

The day after, it went back to being laggy again (I checked and it still had the older driver, so wasn't because of an automatic update or anything like that).

I get the feeling the drivers/vmtools are not at fault here, and there is some underlying issue with ESXi itself. Shame my VMs are HW version 13 now, or I could have moved to a host on 6.0 to see if the issue followed it.

My incident number is: 17471595505

I am expecting them for a webex today to have a first hand look. I intend to show them the VM heartbeat alerts we get during logon/logoff storms, as well as general slowness within the VM.

Esxtop doesn't seem to show any high numbers for latency, which is all they have asked me to look for so far.

James

0 Kudos
galbitz_cv
Contributor
Contributor

Let me know how you make out, please share with them my support ticket as well. i have the exact same issue, esxtop also shows nothing of interest. i have since downgraded to 6.5 GA and do not have issues. I do not know what changed with the mentioned upgrades, nor does vmware obviously They sighted the info below with me but it did not help:

from vmware poc:

They have been going through all the changes that have been made between the two builds (the build that is not affected & the one that is) and focusing on the various changes between them that might be contributing to the issue.

This is the only change they suspect might be causing the issue so far

They set iovDisableIR to true on the affected build

Either that or the "Handling unaligned unmap request from Guest OS"

So, would you be in a position to place on affected host into testing mode and run the following command to change a parameter.

"Handling unaligned unmap request from Guest OS"

 

esxcli system settings kernel set --setting=iovDisableIR -v FALSE


The above command can be run from a putty session to the host

You would then need to reboot the host and perform the performance test again and let me know the results.

I tried running esxcli system settings kernel set --setting=iovDisableIR -v FALSE and it did not resolve the issue.



0 Kudos