Enthusiast
Enthusiast

ESXi 6.5 Slow vms, High "average response time"

I am running esxi 6.5 with the latest patches and vmware tools 10.1.5

I am having very inconsistent performance issues with both of my hosts. Basically the windows 2016/windows 10 guests are sluggish at times. nothing will load and the os is basically unresponsive when interacting with the gui. The issue seems to be stemming from disk performance but I am not 100% certain that this is the cause, it may be a side affect.

What I have noticed is that some vms show a average response time for the disk of about 2000ms. Yet if i check the performance monitor at a host level the disk and datastores are all showing sub 1ms response time. I am not able to explain the inconsistencies there.

I have a local ssd datastore on each host as well as a rather fast nvme iscsi san that is connected via 100gb mellanox connectx4 cards. I see the issue with both hosts and both datastores. The issue seems to be worse now with the most recent patches and vmware tools drivers. I am using vmxnet3 network cards and paravirtual scsi controllers on all vms.

I have run disk benchmarks on the vms and the resutls vary. I have already seen it where i run a disk benchmark on a guest, get horrible results, vmotion it to the other host, and benchmarks to the san are fine, and then i vmotion the guest back to the original host and the results are fine the second time I run it.

here is an example of a bad test, the reads are terrible:

-----------------------------------------------------------------------

CrystalDiskMark 5.2.0 x64 (C) 2007-2016 hiyohiyo

                           Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 2) :     0.655 MB/s

  Sequential Write (Q= 32,T= 2) :  5384.173 MB/s

  Random Read 4KiB (Q= 32,T= 2) :     0.026 MB/s [     6.3 IOPS]

Random Write 4KiB (Q= 32,T= 2) :   617.822 MB/s [150835.4 IOPS]

         Sequential Read (T= 1) :     2.306 MB/s

        Sequential Write (T= 1) :  1907.004 MB/s

   Random Read 4KiB (Q= 1,T= 1) :    53.942 MB/s [ 13169.4 IOPS]

  Random Write 4KiB (Q= 1,T= 1) :    52.104 MB/s [ 12720.7 IOPS]

  Test : 50 MiB [C: 5.2% (15.6/299.5 GiB)] (x1)  [Interval=5 sec]

  Date : 2017/03/25 20:29:18

    OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

 

a few seconds later on the same setup i get perfectly fine results:

-----------------------------------------------------------------------

CrystalDiskMark 5.2.0 x64 (C) 2007-2016 hiyohiyo

                           Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 2) :  6655.386 MB/s

  Sequential Write (Q= 32,T= 2) :  5654.851 MB/s

  Random Read 4KiB (Q= 32,T= 2) :   695.193 MB/s [169724.9 IOPS]

Random Write 4KiB (Q= 32,T= 2) :   609.216 MB/s [148734.4 IOPS]

         Sequential Read (T= 1) :  1810.393 MB/s

        Sequential Write (T= 1) :  1626.112 MB/s

   Random Read 4KiB (Q= 1,T= 1) :    53.266 MB/s [ 13004.4 IOPS]

  Random Write 4KiB (Q= 1,T= 1) :    54.289 MB/s [ 13254.2 IOPS]

  Test : 50 MiB [C: 5.2% (15.7/299.5 GiB)] (x1)  [Interval=5 sec]

  Date : 2017/03/25 20:32:21

    OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

115 Replies
Enthusiast
Enthusiast

Hi Galbitz,

Thanks for that info.

What is your SR number and I will advise the support person I am dealing with about it.

Odd that you mentioned the below changes. I was previously having issues shrinking a thin provisioned VMDK in 6.0, and since I updated to 6.5, the VMDKs seem to have shrunk themselves somehow (I am assuming it is to do with the changes with the UNMAP commands/processes).

Oddly, this feature is only supposed to work with VMFS 6.x, and my datastores are VMFS 5, so not sure how that has managed to work.

Perhaps in my case it is something to do with unmapping, who knows.

James

0 Kudos
Enthusiast
Enthusiast

Hello everyone

I'm experiencing the same problem in my test environment.

The issue goes away if I relocate the affected VMs from our EqualLogic SAN to local disks on a Dell PERC controller.

Perhaps the issue is with the software iscsi driver?  That's what I'm using.

I'm using several versions of 6.5 and they all have the same issue.

6.5 builds 4887370, 5224529 and 5310538

MBR or EFI

Tools 10.1.0 or 10.1.7

Version 11 or 13 hardware

Only happens to Windows 10 and Server 2016 VMs

Any of my hosts without a SAN work fine.

0 Kudos
Contributor
Contributor

Cory my incident# is 17417329003

0 Kudos
Contributor
Contributor

I was seeing this issue with a local datastore (dell perc h730 also). I noted a few other users posted in this thread that it occurred on local datastores as well as san. I am only running 2016 and win10 so we are consistent there.

0 Kudos
Enthusiast
Enthusiast

Spoke too soon, while I did see better performance moving to local disk, after a short time the performance issues came back.

We just created a ticket, 17475209405

0 Kudos
Enthusiast
Enthusiast

Windows 10 version 1703 may not be affected by this.  I have a few of them and can't reproduce the performance issues.


Can anyone else confirm if 1703 works okay?

0 Kudos
Contributor
Contributor

Does any one have a fix from VMware? The case I had opened didn't go anywhere and the engineer kept saying it was a driver and HW issue. I even told the engineer about this thread on no escalations.

From my experience and everything here I don't agree with that.

My environment is fine only because I went back to ESXi 6.5 base (VMware ESXi, 6.5.0, 4887370). I have tried all newer patches and the performance problem exists.

At this point I will not attempt a full scale upgrade. I have one host on the latest patches for testing.

0 Kudos
Contributor
Contributor

I am not aware of any solution, if we find one i will def post back here as I imagine others will. It likely has something to do with umap support. Are you seeing this on windows 10 and 2016 only?

0 Kudos
Contributor
Contributor

From any patch post 6.5 base, I have been seeing this in Windows 2012R2, Win 2016 and Windows 10.

Not only are VMs slow when running on the hosts but we also use guest customization specs. Normally that whole process takes ~3 minutes. With patched versions takes > 12 minutes.

0 Kudos
Contributor
Contributor

Same here,

  • Dell R730 with local 4xSSD@RAID10
  • Dell-ESXi-6.5.0-5310538-A03
  • Fully patched Windows 2016 guests with most recent VMware tools (&& PVSCSI 1.3.8)
  • Thin provisioned disks
  • iovDisableIR @ FALSE

...and disk i/o is extremely slow most of the time. With LSI Logic SAS I could install and patch Windows but with PVSCSI it was timing out on Windows update install. Now I tried to clone test vm with thick provisioned disks and so far everything seems to be running smoothly.

0 Kudos
Enthusiast
Enthusiast

Did you use thick lazy or eager?

0 Kudos
Contributor
Contributor

Eager

0 Kudos
Enthusiast
Enthusiast

I'm not sure if this helps but a couple of the VMs with issues are now responding properly.  Somehow the VMXNET3 network cards had directpath i/o turned on when we don't use that feature.  After turning it off and rebooting a couple of our trouble VMs they are working okay now.  We're going to audit all our VMs to see if this is a global issue.

Can anyone else review their systems and see if this helps them?


We don't have SRV-IO enabled in the bios of our PowerEdge Servers.

There is a KB about this, I hope they fix this bug:DirectPath I/O option is enabled automatically when a virtual machine with VMXNET3 is created using ...

Seems almost all of my VM somehow got this setting turned on when we did not turn it on ourselves.

See this link for a powercli script to disable:  https://virtualnomadblog.com/2016/11/25/vsphere-6-0-vmxnet3-and-directpath-io-issue/

Enthusiast
Enthusiast

One more small update.  A few of the systems non-responsive were thin provisioned so we ran windows defrag/optimize to trim the storage and that helped them get back to working properly and responding well.

0 Kudos
Enthusiast
Enthusiast

I have just gotten off the phone with support, and they have confirmed to me that the engineering team are currently aware of an issue with the latest version of the pvscsi driver.

Our entire estate of VMs uses the Paravirtual controller, and have the latest version of the driver installed.

They suggested the current workaround was to downgrade the version of the driver (one of the first things I tried as mentioned in one of my earlier replies), but then said that this doesn't always seem to resolve the performance problems.

My support ticket has been tagged against this issue, and I will be getting an update once the dev team have a fix, or release a new build not affected by this problem.

Hopefully most of you all also use the pvscsi controller, and this will be a little comforting to you to know they have now acknowledged the issue and are working on it.

James

Contributor
Contributor

I’ve been experiencing the same issues and found this post.

In reading the other post, it seems the issue is related to the VMware Paravitual SCSI adaptor.

As a test I changed the SCSI controller type from VMware Paravirtual to LSI Logic SAS. After this change the high response times seemed to be resolved. I have been copying data at over 600Mbps for an hour and response times are less than 7ms.

This seems to be an effective work-around in my case. 

Contributor
Contributor

We have been having the issue ever since applying the first update (of three) to the HPE 6.5 base image, and we ARE using the LSI Logic SAS.controller. We have been using that all along.

Also, we are running local storage - RAID1 over two 10K disks. We have moved VMs to other similar datastores to no effect. We are on the newest datastore format - VMFS 6 (and Virtual Hardware 13).

Furthermore, we have run into the exact same issue on identical hardware (HPE Proliant ML350Gen9 with local storage with Windows 2016 Server Standard VMs), so why VMware would have any difficulty recreating it is a bit of a mystery to us.

We are running thin provisioned, so the idea earlier in this thread that thick provisioned may help stills sounds viable. Hopefully others can verify.

In regards to "getting back" to the original build, if that is what some are contemplating, here are our findings (in case it may help others): It is only possible to go one "build" backwards using Shift-R. So, if you have only applied one update to your host(s), then Shift-R is still an option for you. Otherwise, as far as we can tell, the only way to get back is to boot from the original 6.5 installation media and select "Install". This operation should still retain your datastores, but may erase pretty much all other settings, such as your configuration of virtual switches (we have observed that our datastores remained, but not seen what else may be overwritten). So, at least as far as the VMs themselves go, afterwards you would just need to "add" the VMs back into your inventory (by pointing to the vmx file). However, it is always wise to MAKE A BACKUP of your VMs first (maybe just by exporting them), just in case something goes wrong.

Br,

Hans

0 Kudos
Contributor
Contributor

I switches my VM's from VMware Paravirtual to LSI logic SAS and VM's are responding normal again.

The behavior I observed was normal latency when booting then (after a burst of high I/O) latency remained high (2000-3000ms constant with 100% disk activity according to task manager)

VM:

Microsoft Windows Server 2016 (64-bit)

VM version 13

VMware tools 10.1.5 (5055683)

Hosts:

1 x HP ProLiant DL360 Gen9 (HP image updated to VMware ESXi, 6.5.0, 5310538)

2 x Supermicro SYS-6018U-TR4T+ (updated to VMware ESXi, 6.5.0, 5310538)

Storage:

Superserver 2028R-E1CR24L (12x1TB SSD RAID10, FreeNAS)

All storage via iSCSI over 10Gbit copper via a Netgear XS708E

Normal performance +- 1200MB/s, iops (4k 64thrd): 135000 read / 52000 write

0 Kudos
Contributor
Contributor

Did you use thin or thick provisioned disks?

0 Kudos
Contributor
Contributor

I received this today from vmware. I do not currently have a host on the affected version (had to downgrade) but i wanted to throw this out here incase anyone can test:

I have heard back from engineering and they are suggesting the following change to make inside the guest OS of one of the problematic VMs - see comments below.

"Since one suspicion here is the change that affected unaligned unmap behavior, one thing we might try is having the customer disable this and see if it makes a difference.

You disable it by setting a registry key inside the guest OS

Add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem\DisableDeleteNotification" and set it to 1 (I believe this would be a REG_DWORD value)."

Can you make the above change to one of the VM, reboot it and let me know if you see any improvement in performance.