Cryptz
Enthusiast
Enthusiast

ESXi 6.5 Slow vms, High "average response time"

I am running esxi 6.5 with the latest patches and vmware tools 10.1.5

I am having very inconsistent performance issues with both of my hosts. Basically the windows 2016/windows 10 guests are sluggish at times. nothing will load and the os is basically unresponsive when interacting with the gui. The issue seems to be stemming from disk performance but I am not 100% certain that this is the cause, it may be a side affect.

What I have noticed is that some vms show a average response time for the disk of about 2000ms. Yet if i check the performance monitor at a host level the disk and datastores are all showing sub 1ms response time. I am not able to explain the inconsistencies there.

I have a local ssd datastore on each host as well as a rather fast nvme iscsi san that is connected via 100gb mellanox connectx4 cards. I see the issue with both hosts and both datastores. The issue seems to be worse now with the most recent patches and vmware tools drivers. I am using vmxnet3 network cards and paravirtual scsi controllers on all vms.

I have run disk benchmarks on the vms and the resutls vary. I have already seen it where i run a disk benchmark on a guest, get horrible results, vmotion it to the other host, and benchmarks to the san are fine, and then i vmotion the guest back to the original host and the results are fine the second time I run it.

here is an example of a bad test, the reads are terrible:

-----------------------------------------------------------------------

CrystalDiskMark 5.2.0 x64 (C) 2007-2016 hiyohiyo

                           Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 2) :     0.655 MB/s

  Sequential Write (Q= 32,T= 2) :  5384.173 MB/s

  Random Read 4KiB (Q= 32,T= 2) :     0.026 MB/s [     6.3 IOPS]

Random Write 4KiB (Q= 32,T= 2) :   617.822 MB/s [150835.4 IOPS]

         Sequential Read (T= 1) :     2.306 MB/s

        Sequential Write (T= 1) :  1907.004 MB/s

   Random Read 4KiB (Q= 1,T= 1) :    53.942 MB/s [ 13169.4 IOPS]

  Random Write 4KiB (Q= 1,T= 1) :    52.104 MB/s [ 12720.7 IOPS]

  Test : 50 MiB [C: 5.2% (15.6/299.5 GiB)] (x1)  [Interval=5 sec]

  Date : 2017/03/25 20:29:18

    OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

 

a few seconds later on the same setup i get perfectly fine results:

-----------------------------------------------------------------------

CrystalDiskMark 5.2.0 x64 (C) 2007-2016 hiyohiyo

                           Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

   Sequential Read (Q= 32,T= 2) :  6655.386 MB/s

  Sequential Write (Q= 32,T= 2) :  5654.851 MB/s

  Random Read 4KiB (Q= 32,T= 2) :   695.193 MB/s [169724.9 IOPS]

Random Write 4KiB (Q= 32,T= 2) :   609.216 MB/s [148734.4 IOPS]

         Sequential Read (T= 1) :  1810.393 MB/s

        Sequential Write (T= 1) :  1626.112 MB/s

   Random Read 4KiB (Q= 1,T= 1) :    53.266 MB/s [ 13004.4 IOPS]

  Random Write 4KiB (Q= 1,T= 1) :    54.289 MB/s [ 13254.2 IOPS]

  Test : 50 MiB [C: 5.2% (15.7/299.5 GiB)] (x1)  [Interval=5 sec]

  Date : 2017/03/25 20:32:21

    OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

115 Replies
MTomasko
Enthusiast
Enthusiast

KB says the issue is resloved in VMware ESXi 6.5 Update 1.  I'm running VMware ESXi 6.5 Update 1 so I did not try the registry edit.

0 Kudos
sessionx
Enthusiast
Enthusiast

The registry fix worked for many others.  It's a quick test and the performance is night and day if it works.

0 Kudos
andyandy806
Contributor
Contributor

I had the same issue but I'm running 6.0u2 with a local raid 10.

finally, I downgraded the storage driver from lsi-mr3 back to megaraid-perc9 and solved the problem.

0 Kudos
TomHowarth
Leadership
Leadership

have you run though the optimizations in the following doc?

Windows Server 2016 Performance Tuning Guidelines | Microsoft Docs

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410
0 Kudos
PGinhoux
Enthusiast
Enthusiast

Hi,

I have new DELL R630 installed with the latest 6.5 Update 1 but I experience bad response times on a first VM 2012 R2. it needs about 15 minutes after a reboot before to log in.

I have applied to workaround (Performance issues on Windows virtual machine with hardware version 13 after upgrading to ESXi 6.5 (2150591)) with no change.

Any thoughts of what I could check/change ?

0 Kudos
vrod1992
Contributor
Contributor

I just deployed a new esxi 6.5 host as well with the latest updates, build 6765664. Seeing high disk latencies as well on a Win10 VM. Migrated it back to another host with build 5969303. Problem disappeared then.

Hosts connect to a Ubuntu VM with a ZFS pool hosted on a P3700 NVMe SSD (10gb nfs) where I do not see any I/O constraints as well. looks like the newest build has brought these issues back.. Smiley Sad

0 Kudos
sessionx
Enthusiast
Enthusiast

We're running the latest October Update 6765664 and do not see any latency issues like we did prior to 6.5 Update 1.  Your latency issue may be another issue entirely as this was also only affecting Microsoft Windows from 2012, 2012 R2, 2016 and Windows 8.1 and 10.  Linux was not affected by the original latency issue prior to 6.5 Update 1 that we experienced.

Have you opened a ticket with VMware?

0 Kudos
galbitz_cv
Contributor
Contributor

I do not believe he is stating it is occurring on a linux vm, he is stating the underlying storage is a zfs share hosted on a linux box.

For what it is worth I am not seeing the issue either. I am using a nvme ssd based zfs backend as well. I am connecting via iser/iscsi though. I did originally have the issue before the patch (this is actually my original thread). So there is likely something else going on, could be something specific to nfs, our setups seem similar on the surface.

0 Kudos
sessionx
Enthusiast
Enthusiast

In that case, the best way to know for sure if it was the original issue is to follow the KB and make the registry change.  That always fixed the problem for us on dozens of hosts.

0 Kudos
PGinhoux
Enthusiast
Enthusiast

Hi,

Strange you don't see this latency with the build 5969303. This is the level on my my ESXi hosts and I have the problem with W2012 R2 VMs. The datastorage in on a NFS pool, but I don't know what is it as I don't manage this resource.

I just open a ticket to VMware today and spend about 2 hours with their support guy without a solution for the moment. However there is a suspicion with the disk that are thin provisioned. We tried to inflate the disk but this action failed for some reason he will investigate on his side.

I'll keep you posted as soon as they come back to me tomorrow.

0 Kudos
vrod1992
Contributor
Contributor

Of course I cannot say for sure that the issue has returned but it very much looks like this. I forgot to mention however that the P3700 ZFS storage is hosted in a Ubuntu VM on the "new" host. The smallest operations on the Win10 VM will be causing the latency inside the Windows OS to spoke and the activity to go to 100% right away. On the Ubuntu VM I see almost no CPU load or IO load (iostat -x 1) at all, maybe about 3% disk load.

My setup is this right now:

Host 1 (newest build) - IBM x3550 M4 (2x2660v2, 256gb 1600mhz DDR3), Ubuntu VM with 2TB P3700 (Passthrough) and 96GB memory - Accesses NFS within vSwitch

Host 2 (6.5 U1 build) - Dell C6220 Node (2x2650, 128gb 1066mhz DDR3) - Accesses the NFS over 10Gbe

Host 3 (6.5 U1 build) - Dell C6220 Node (2x2650, 64gb 1066mhz DDR3) - Accesses the NFS over 10Gbe

When VM is on Host 1 = Disk problems

When VM is on Host 2 = No disk problems

I will try to upgrade Host 3 to newest build and migrate the VM over there. If the issue then occurs again, it isn't the network at least. I've used VM's as NFS datastores for a long time and newer had issues like this.

0 Kudos
sessionx
Enthusiast
Enthusiast

Hi everyone

My team informed me today that if we don't use the registry fix in the KB that we see poor latency disk i/o issues.

It seems that the patches have not fixed this issue

The workaround here is still valid:  https://kb.vmware.com/s/article/2150591

0 Kudos
PGinhoux
Enthusiast
Enthusiast

Hi,

An interesting information on my problem.

As a reminder, I have a cluster a 3 ESXi hosts (Dell R630) running DellEMC-ESXi-6.5U1-7388607-A07 (I recently update the ESXi with the Dell Image).

The datastore configured is on a NFS cluster. Up to now I had not detail about this NFS Cluster as it is handled in another structure.

The problem was that my W2012 VM took about 20 minutes to reboot.

During a VM reboot, we observed that the vmkernel.log was continously filled of these messages :

2017-11-22T12:51:25.756Zcpu33:66453)WARNING: NFS: 4719: Short read for object b00f 60 e18c8c1a a6b03ef41 8000044b 0 62 87960d8a 8000044b 6000000000 4080500290 100000000 0 offset:
0xacaf7400 requested: 0x2b400 read: 0x10000 2017-11-22T12:51:26.006Z cpu33:66453)

I had open a ticket to VMware and they ask us more detail on the NFS datastore. Eventually we get more detail on it. This NFS is on a NetApp cluster with the following settingsfor the vServer:

                                           Vserver: dvhp1nasnd1vb08-pri

                                General NFS Access: true

             RPC GSS Context Cache High Water Mark: 0

                              RPC GSS Context Idle: 0

                                            NFS v3: enabled

                                          NFS v4.0: disabled

                                      UDP Protocol: disabled

                                      TCP Protocol: enabled

                               Spin Authentication: disabled

                              Default Windows User: -

                       Enable NFSv3 EJUKEBOX error: true

Require All NFSv3 Reads to Return Read Attributes: false

Show Change in FSID as NFSv3 Clients Traverse Filesystems: enabled

Enable the Dropping of a Connection When an NFSv3 Request is Dropped: enabled

                Vserver NTFS Unix Security Options: use_export_policy

                     Vserver Change Ownership Mode: use_export_policy

                        NFS Response Trace Enabled: false

                    NFS Response Trigger (in secs): 60

                         UDP Maximum Transfer Size: 32768

                         TCP Maximum Transfer Size: 65536

                       NFSv3 TCP Maximum Read Size: 1048576

                      NFSv3 TCP Maximum Write Size: 65536

The VMware supports has suggested to change the NFSv3 TCP Maximum Read Size from the current value : 1048576 to 65536 (see https://library.netapp.com/ecmdocs/ECMP1196891/html/GUID-678ABF68-C888-4517-A51D-A98BD96CA851.html).

This change has been done on the vServer, and then now my Windows 2012 VM reboots in few seconds. I did the same for a Windows 2016 vm with the same good result.

We still have issues with the Read/Write performance that we are still working on... but the main issue is now fixed by changing the NFSv3 TCP Maximum Read Size value.

I hope this information can help.

Regards

Patrick

0 Kudos
abeeftaco
Contributor
Contributor

Had the same issue. Updated to ESXi 6.5 U2 and it seems to be fixed.

0 Kudos
CraigD
Enthusiast
Enthusiast

I am very pleased to have found this thread.  I read every post, as we are experiencing the issue right now.  What is interesting to me is that we upgraded to the Dell-customized ESXi 6.5 U1g (build 7967591) at the end of April (three months ago), but only updated VMware Tools about three weeks ago (July 14).  I don't believe we saw these performance issues until we updated VMware Tools.  Does this sound possible or consistent with what you guys are seeing?  I am hopeful that I can correct the problem by updating to the U2 version (fingers crossed that Dell has it available... haven't checked yet).  I literally JUST got finished reading this thread.

I am trying to decide if I can take a host into maintenance mode, update it (if the U2 ISO is available), and potentially have my problem fixed immediately.  I appreciate any input you have.

EDIT  This article says the issue was corrected in 6.5 U1, which I am already running.

0 Kudos
vesak78
Contributor
Contributor

This thread was extremely helpful. Big thumbs up to everyone here. We also believe that this issue has not been 100% resolved by VmWare yet.

We also saw 500+ ms IO latencies happening on guest OS level. This started after we had deleted about 1500 GB worth of files on the guest OS.

Our configuration:

  • ESXi, 6.7.0, 15160138
  • Windows Server 2012 R2
  • SQL Server 2016 Standard (24 cores, running 10-20 k SQL queries per second)
  • EMC Unity 480 XT AFA (having lots of free IO capacity)

We vMotioned the guest OS to another ESXi host (6.5.0, 7388607) and the problem immediately disappeared. Our theory is that space reclamation stopped or completed due to this transition as the older ESXi handled space reclamation better. After vMotioning the guest OS back to original ESXi host the problem did not re-appear.

To be safe we performed following configuration changes on guest OS:

  • Get-ItemProperty -Path “HKLM:\System\CurrentControlSet\Control\FileSystem” -Name DisableDeleteNotification
  • fsutil behavior set DisableDeleteNotify 1
  • Get-ItemProperty -Path “HKLM:\System\CurrentControlSet\Control\FileSystem” -Name DisableDeleteNotification
  • Guest OS Reboot

We also updated our ESXi host to latest patch version (6.7.0, 15820472)

The IO latency problem has not reoccurred since the above changes were made. Everything has been OK for few days now.

0 Kudos