VMware Communities
superciliousdud
Enthusiast
Enthusiast

Disk write cache broken in Workstation 14

Hi,

I have upgraded from 12.x to 14 and despite not changing anything in my configuration, the write caching is not working anymore.

I'm running 14.1.1. on a Windows 10 x64 host.

I have tried any and all the following options in my config.ini:

hard-disk.hostBuffer = "enabled"

hard-disk.useUnbuffered = "FALSE"

hard-disk.synchronous = "FALSE"

aiomgr.buffered = "TRUE"

aiomgr.unbuf = "FALSE"

Still, despite my entire VMDK being cached in RAM, the vmware-vmx process waits for I/O to complete before acknowledging it inside the VM. How can I fix this and get the old behaviour?

I depend on the I/O performance inside the VM and many of my workloads are broken without it. My system is using ECC RAM, and the entire workstation is powered by a 3KVA UPS and running a 12 disk RAID6 so I'm not at all worried about data loss. I just desperately need to disable the caching so I can upgrade to 14.x again. In the meantime, I have been forced back to Workstation 12. Smiley Sad

17 Replies
superciliousdud
Enthusiast
Enthusiast

I know that this does not constitute a viable solution to this problem for most people, but I managed to find a workaround that works for my purposes. I monkey patched the vmware-vmx.exe file to simply remove the sync request from the file handle at creation time.

All I did was inject a single AND instruction before each call to CreateFileW() - the equivalent of:

dwFlagsAndAttributes &= ~(FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH)

This is simply ANDing the register containing dwFlagsAndAttributes with 0x5FFFFFFF, and thus no more bypassing Windows cache.

With this simple tweak I am back to having >1 million IOPS in my virtual machines again as I did with Workstation 12.5.x

Hope this helps others who have been burned by this problem and VMware's increasingly hostile attitude towards Workstation users.

mackpt1
Contributor
Contributor

Here is how to enable Write Caching in windows 10 To Enhance Performance

1:- Press the Windows+R key to move on to the Run page. Type in it devmgmt.msc and hit the Enter key.

2:- On the device manager page that pops up,scroll down to see the options under disk drives and expand them. Choose the device of your choice and right click on them.Click on the list,the Properties option.

3:- Out of the various options available on top, click on the Policies option.Tick on the text box next to Enable Disk caching and press the OK button.

4:- Open up any of the complex apps that use to run slower before to see that they now require very little time to move on.

Source:- https://merabheja.com/enable-write-caching-in-windows-10/

Reply
0 Kudos
superciliousdud
Enthusiast
Enthusiast

No, I am not talking about caching at the disk layer - this has an immeasurably small effect on performance for my workload. I am talking about caching at the file-system layer on a Windows host OS. My ugly solution is a literal 2 orders of magnitude (100x) increase in performance, and is not limited to tiny disk caches of a few megabytes in size, but the entire 128GB of RAM in my host system can be used to cache reads and writes from VMs. Going from around 10K IOPS in a virtual machine (random 4K, single FIFO) to around 1000000 IOPS is a huge difference in performance.

The fact that it is no longer supported in VMware has forced me to start migrating to another hypervisor. With a Linux host OS and KVM, I get approximately 20% better performance than VMware, but worse virtual GPU support in Windows VMs so its a difficult trade-off.

bonnie201110141
VMware Employee
VMware Employee

Can you please try below option?

hard-disk.useUnbuffered = “TRUE”

Let me if it works for you.

Reply
0 Kudos
bonnie201110141
VMware Employee
VMware Employee

All other options should be deleted.

Reply
0 Kudos
superciliousdud
Enthusiast
Enthusiast

No, the performance is back to being terrible with that option. It looks like there is no "official" way to disable sync I/O with Workstation 14.

Reply
0 Kudos
bonnie201110141
VMware Employee
VMware Employee

Sorry, we misunderstood your issue. There seems to be a bug in our product. And we are trying to fix this. Meanwhile, please try adding below option in config file as a workaround.

aiomgr.simple="Generic"

superciliousdud
Enthusiast
Enthusiast

Thanks for your response, but I am not in a position to test it right now as I have replaced Workstation 14 with Workstation 12 on all the machines here because we discovered a severe bug with Workstation 14.1.1 giving (non-deterministic) incorrect results for a deterministic computation.

I cannot overstate how incredibly frustrating it has been over the past week or so trying to figure out what the problem was as we never suspected a hypervisor bug. The error occurs only inside VMs and only when those exact same VMs are running under Workstation 14.1.1 - no problem with those same VMs running under Workstation 12.5 or natively on the host (we use Server 2016 on a variety of machines, mainly HP DL360 and some Lenovo). There are no ECC errors logged and the problem occurs on Sandy Bridge CPUs through all generations to Broadwell CPUs - we don't have any Xeon Golds or EPYC chips yet to test on. All run the tests fine on Workstation 12 and natively on the host OS, all fail (with different results) on Workstation 14.

Is VMware aware of this bug and is there a fix in the works?

Reply
0 Kudos
bonnie201110141
VMware Employee
VMware Employee

We are so sorry to hear that you ran into severe issue with Workstation 14.1.1. Can you please give a detailed description about your issue? What kind of computation are you running in VM? How can we reproduce the issue locally? Thanks a lot!

Reply
0 Kudos
superciliousdud
Enthusiast
Enthusiast

I'm not aware of how to reproduce the problem easily. We have a long test-suite that runs for about 10 hours every night on at least one of our machines in a VM. We use VMware workstation as our hypervisor and a mix of Server 2012 R2 and Server 2016 as host and exclusively on Xeon CPUs (all generations from sandy bridge to broadwell). Our shorter test-suites (<1 hour) pass on Workstation 14 just fine, and longer test-suites (~3 hours) only fail intermittently. However, the biggest test suite runs for just under 10 hours and fails 100% of the time under Workstation 14, and each time the output data has a different checksum not matching the known-good result.

At first, I suspected bad hardware and replaced the CPUs, RAM and the raid controller (HP P822), but the problem persisted so I replaced the server and then again with another vendor's hardware. When that still showed the same problem I ran the test overnight on all our idle machines. The passing ones had only one thing in common: Workstation 12. I then tested the failing servers by running the test suite natively and they all passed.

I am not in a position to provide a copy of our codebase as a test case to vmware, but the problem is 100% reproducible. If it helps, I can run the test on a spare machine at home and provide copies of logs. Please let me know what specific steps to take to generate the necessary logs.

superciliousdud
Enthusiast
Enthusiast

Hi bonnie201110141,

Thanks to the 4 day weekend, I finally had some time to run some more extensive tests. The good news is that the aiomgr.simple="Generic" config setting works and the I/O performance is restored in 14.1.1. However, an additional effect is that the non-deterministic wrong results in our test suite also disappeared with this option.

I am now 100% convinced there is a race-condition or similar bug in the default aiomgr implementation of Workstation 14.x.

The bug occurs on VMDKs stored on an NTFS volumes using Windows 10/server 2016 storage spaces. The bug disappears when using aiomgr.simple="Generic". The bug is Workstation 14.x specific, no version of 12.x has this behaviour. The bug is easily triggered with high I/O using a mix of random reads and writes, not sequential I/O. The bug is independent of which virtual disk adapter is used, both pvscsi and lsisas exhibit the bug. The bug is independent of which guest OS is used, both Linux, Windows 7 and Windows server 2016 guests can trigger the bug.

Hopefully VMware can track this down and fix it, but in the meantime, setting aiomgr.simple="Generic" works correctly.

bonnie201110141
VMware Employee
VMware Employee

Thanks for your tests! We will try to reproduce locally and investigate.

Reply
0 Kudos
richard612
Enthusiast
Enthusiast

Just stumbled across this thread by superciliousdude whilst Google searching for ways to more aggressively disk cache and get VMware Workstation running a bit faster.  Very relevant to my interests.

This is in a VM on Workstation 14.1.1 running Server 2016 prior to aiomgr.simple="Generic":

-----------------------------------------------------------------------

CrystalDiskMark 6.0.0 x64 (C) 2007-2017 hiyohiyo

-----------------------------------------------------------------------

Sequential Read (Q= 32,T= 1)  :   215.485 MB/s

Sequential Write (Q= 32,T= 1) :   253.429 MB/s

Random Read 4KiB (Q=  8,T= 😎 :    27.683 MB/s [   6758.5 IOPS]

Random Write 4KiB (Q=  8,T= 8):    33.578 MB/s [   8197.8 IOPS]

Random Read 4KiB (Q= 32,T= 1) :    17.497 MB/s [   4271.7 IOPS]

Random Write 4KiB (Q= 32,T= 1):    10.862 MB/s [   2651.9 IOPS]

Random Read 4KiB (Q=  1,T= 1) :     4.866 MB/s [   1188.0 IOPS]

Random Write 4KiB (Q=  1,T= 1):    11.483 MB/s [   2803.5 IOPS]

  Test : 500 MiB [C: 39.0% (15.4/39.5 GiB)] (x1)  [Interval=5 sec]

  Date : 2018/04/26 17:31:24

    OS : Windows Server 2016 Datacenter (Full installation) [10.0 Build 14393] (x64)

This is after aiomgr.simple="Generic":

-----------------------------------------------------------------------

CrystalDiskMark 6.0.0 x64 (C) 2007-2017 hiyohiyo

-----------------------------------------------------------------------

Sequential Read (Q= 32,T= 1)  :  805.823 MB/s

Sequential Write (Q= 32,T= 1) :  339.611 MB/s

Random Read 4KiB (Q=  8,T= 😎 :  188.217 MB/s [  45951.4 IOPS]

Random Write 4KiB (Q=  8,T= 8):  169.168 MB/s [  41300.8 IOPS]

Random Read 4KiB (Q= 32,T= 1) :   51.101 MB/s [  12475.8 IOPS]

Random Write 4KiB (Q= 32,T= 1):   41.781 MB/s [  10200.4 IOPS]

Random Read 4KiB (Q=  1,T= 1) :   26.277 MB/s [   6415.3 IOPS]

Random Write 4KiB (Q=  1,T= 1):   15.358 MB/s [   3749.5 IOPS]

  Test : 500 MiB [C: 38.8% (15.3/39.5 GiB)] (x5)  [Interval=5 sec]

  Date : 2018/04/26 19:25:23

    OS : Windows Server 2016 Datacenter (Full installation) [10.0 Build 14393] (x64)

 

I think VMware has a problem on their hands.  Related question: can I put this setting in settings.ini or config.ini to make it global?

Edit: Yes.  It goes in config.ini.  Just tested this.

bonnie201110141
VMware Employee
VMware Employee

Yes, we are aware of this issue and working for a fix. Currently, please add that option in config.ini. Thanks!

superciliousdud
Enthusiast
Enthusiast

bonnie201110141 can you please confirm whether this bug is fixed in the new 14.1.2 release? it is not mentioned in either the resolved or known issues sections of the release notes.

Thanks

Reply
0 Kudos
bonnie201110141
VMware Employee
VMware Employee

With VMware Workstation 14.1.2, buffer is on by default now. Please try WS14.1.2, and also let us know if it fixes your last issue about your test-suite.

Reply
0 Kudos
Testerhood
Contributor
Contributor

Today I upgraded to VMware Workstation Player 15.0.3 (original version was 12.5.8) and I immediately noticed that VMs were trashing my host system's memory cache. Of course it's cool that VMs benefit of greatly increased IO performance with memory cache enabled, but I wasn't amazed that this is automatically enabled by default - most notably without any warning or hint. I haven't found any changelog that mentions this, only this well hidden forum post. So, funnily enough, I can confirm that this change works. However, the unwanted side effect is that it's enabled for everything - according to the experience I made. In order to disable this, I added a line called hard-disk.hostBuffer = "disabled" to "C:\ProgramData\VMware\VMware Workstation\config.ini". Now it's globally disabled, but I have seen it works to enable this for specific VMs by adding disk.hostBuffer = "enabled" in the vmx file for each VM respectively.

Please be more transparent with such changes for upcoming newer versions. Thank you!

Reply
0 Kudos