ESXi 6.5 Slow vms, High "average response time"

Cryptz · ‎03-25-2017

I am running esxi 6.5 with the latest patches and vmware tools 10.1.5

I am having very inconsistent performance issues with both of my hosts. Basically the windows 2016/windows 10 guests are sluggish at times. nothing will load and the os is basically unresponsive when interacting with the gui. The issue seems to be stemming from disk performance but I am not 100% certain that this is the cause, it may be a side affect.

What I have noticed is that some vms show a average response time for the disk of about 2000ms. Yet if i check the performance monitor at a host level the disk and datastores are all showing sub 1ms response time. I am not able to explain the inconsistencies there.

I have a local ssd datastore on each host as well as a rather fast nvme iscsi san that is connected via 100gb mellanox connectx4 cards. I see the issue with both hosts and both datastores. The issue seems to be worse now with the most recent patches and vmware tools drivers. I am using vmxnet3 network cards and paravirtual scsi controllers on all vms.

I have run disk benchmarks on the vms and the resutls vary. I have already seen it where i run a disk benchmark on a guest, get horrible results, vmotion it to the other host, and benchmarks to the san are fine, and then i vmotion the guest back to the original host and the results are fine the second time I run it.

here is an example of a bad test, the reads are terrible:

-----------------------------------------------------------------------

Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 2) : 0.655 MB/s

Sequential Write (Q= 32,T= 2) : 5384.173 MB/s

Random Read 4KiB (Q= 32,T= 2) : 0.026 MB/s [ 6.3 IOPS]

Random Write 4KiB (Q= 32,T= 2) : 617.822 MB/s [150835.4 IOPS]

Sequential Read (T= 1) : 2.306 MB/s

Sequential Write (T= 1) : 1907.004 MB/s

Random Read 4KiB (Q= 1,T= 1) : 53.942 MB/s [ 13169.4 IOPS]

Random Write 4KiB (Q= 1,T= 1) : 52.104 MB/s [ 12720.7 IOPS]

Test : 50 MiB [C: 5.2% (15.6/299.5 GiB)] (x1) [Interval=5 sec]

Date : 2017/03/25 20:29:18

OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

a few seconds later on the same setup i get perfectly fine results:

-----------------------------------------------------------------------

Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 2) : 6655.386 MB/s

Sequential Write (Q= 32,T= 2) : 5654.851 MB/s

Random Read 4KiB (Q= 32,T= 2) : 695.193 MB/s [169724.9 IOPS]

Random Write 4KiB (Q= 32,T= 2) : 609.216 MB/s [148734.4 IOPS]

Sequential Read (T= 1) : 1810.393 MB/s

Sequential Write (T= 1) : 1626.112 MB/s

Random Read 4KiB (Q= 1,T= 1) : 53.266 MB/s [ 13004.4 IOPS]

Random Write 4KiB (Q= 1,T= 1) : 54.289 MB/s [ 13254.2 IOPS]

Test : 50 MiB [C: 5.2% (15.7/299.5 GiB)] (x1) [Interval=5 sec]

Date : 2017/03/25 20:32:21

OS : Windows 10 Enterprise [10.0 Build 14393] (x64)

Kassebasse · ‎03-27-2017

Defragmentation can help with the disk response time.

It can improve disk performance up to a big noticeable difference

Cryptz · ‎03-27-2017

I will give it a try because i have tried just about everything else. The datastores are ssd based though, fragmentation impact should be minimal, certainly shouldnt take me from 6500MBps to less than 1MBps randomly though. I have seen this on multiple datastores, and when I am seeing the latency I do not see the latency reported on the datastore itself. The cpu's are not overprovioned but i feel like there is some weird scheduling issue. I have opened a case and waiting to be contacted.

galbitz_cv · ‎03-27-2017

no change after defragging, issue seems much bigger than an individual file

kman10 · ‎03-28-2017

Curious, are you using 6.5 P01? After updating several of our hosts to P01 performance degraded drastically (provisioning, customization, guest software installs).

I have a case open but not much other than upgrade my drivers. So I rolled back several hosts to 6.5 base and performance is back to where it was pre 6.5 P01 which is about 3x-5X difference.

The problem for us occurred on both iSCSI and local datastores but support tells my that both drivers are the issue even though there are only 1 release behind and they list multiple versions back on their HCL.

But couldn't afford to suffer the performance hit while waiting for a fix and rolled back.

galbitz_cv · ‎03-28-2017

Yes p01 fully patches. What iscsi San are you using?

kman10 · ‎03-28-2017

Pure Storage backend.

Cryptz · ‎03-28-2017

what about the network cards on the host side initiating the iscsi requests? just curious how wide spread this is. I am using mellanox connectx-4 cards back to an scst linux host. Using 100Gb ethernet, it is also all flash.

They released a patch today, notes didnt mention anything related other than a possible memory leak. Fingers crossed, if not i will rollback/reinstall.

kman10 · ‎03-29-2017

We are using Intel 10Gb NICs for iSCSI and Dell Perc to drive the local SSDs. Slowness is experienced on both with P01.

Cryptz · ‎03-29-2017

So we have intel 520 10gb adapters in the servers for the network side of things, they arent use for storage.

perc h730 is the local datastore, but we see this in servers that do not have the perc card (or any local datastore for that matter) perhaps its related to the intel cards, though the network generally seems ok, its just the storage..

been about 24 hours since i heard back from vmware

psmith · ‎03-29-2017

Just wanted to say thank you for confirming a performance issue we saw as well! It's the first mention I've found anywhere about it and I was beginning to think we were going nuts.

In our case, it affected both HP DL380 G8 and Dell R710 hosts with NFS, FC, and ISCSI storage. Disk access times went to 500ms from 1-5ms for our guests. We found the same fix in rolling back to a pre-3/8/17 patch level.

Has VMware been able to determine a usable fix?

kman10 · ‎03-29-2017

At least for me NO. Support is telling me to upgrade the drivers both Intel NIC and Dell Perc for ESXi which I do not believe is the problem.

Cryptz · ‎03-30-2017

Nothing yet, I am working with vmware and pointed them to this post. What is odd is I do not see any performance metric out of wack when looking at esxtop. Something is just way off.

galbitz_cv · ‎04-03-2017

are you guys booting the servers bios mode or eufi?

galbitz_cv · ‎04-03-2017

uefi i mean

galbitz_cv · ‎04-04-2017

FYI It seems

ESXi650-201703401-BG

is the patch causing the issue.

galbitz_cv · ‎04-04-2017

ESXi650-201703410-SG is cumulative and also has the issue.

avlukashin · ‎04-10-2017

Hi!

I have the same issue. Did you find any solution or reinstall esxi without last updates is only one possible way to resolve the problem?

galbitz_cv · ‎04-11-2017

I have a PR opened with VMware, after providing them with all of the info they are investigating. I know what two patches cause the issue (listed above) but I have only found reverting (shift R at boot) fixes the problem

Zarach · ‎04-13-2017

The timing of this thread coincides with my performance drops as well. I couldn't pin it down to a patch due to my fairly fire and forget method when doing updates on the particular host I'm seeing this issue with. This happens to be my only 6.5 host, in my home/lab environment. I haven't migrated my customers over from 6 to 6.5 yet because to be blunt, I just don't trust it and I haven't since I installed it. It hasnt "felt" right, at least as compared to 6. The forced so called web client has cost me more time than I care to admint - but I digress!

My hardware:

PowerEdge T620

128GB RAM

Perc H710

8x Seagate Constellations in RAID 10

2x 6 core 2.3GHz Xeon

My issue is the same. I have several 2012 R2 VMs as well as some Windows 10 VDAs running XenDesktop 7.13. Coincidentally I have been migrating away from some of my 2012 R2 VMs to 2016, which I thought may have caused my problem. My thought process was, maybe there's an issue with 2016/VMware. I then recalled I had a couple 2016 VMs before I decided to upgrade to 6.5 from 6. The performance drop seems disk related. All signs point to low disk usage and latency from the VMware side, but my Windows 10 VM which has the disk counter enabled will show 100% disk when I experience my slowdown. I haven't dug deep enough to see if that's the case with my other server VMs just yet.

My Dell hardware is all in the green, and I've made sure all the firmware/drivers are up to date as well. I haven't ruled out the RAID controller being flakey but this thread gave me a little hope that it isn't a hardware issue.

I decided to tear my install out and re-install as I run it on the Dell Dual SD Module and it's easy to replace. Unfortunately I can't downgrade easily back to 6 due to my VM hardware version being upgraded to 13 as part of my tests of 6.5. I patched immediately as part of my own habits. I still saw issues. I decided to downgrade my RAID driver in VMware back to an older 6.0 version and still experience issues. I've since installed the latest PERC H710/Avago driver which is slightly newer than VMwares in box version and the spotty performance persists. I may try and rip out the patches stated in this thread and see if that helps. I'll report back if I can get that done.