VMware Communities
Mickou06
Contributor
Contributor

Workstation Pro 16 NVMe controller

Hi all,

I'm encountering NVme error in my guest Linux OS (nvme QID Timetouts) when I use a NVMe entire disk allocated to a virtual machine, I've made some tests, like this:

My host machine is Windows 10 Pro, running on AMD Ryzen 3800XT + NVMe SSD.

I created 2 virtual machines:

vUbuntu1804 => Running Ubuntu 18.04 LTS, Western digital NVMe SSD WDS500 entire physical disk using NVMe controller type.

=> On intensive disk usage, I get "nvme: QID Timeout, completion polled"

vUbuntu2004 => Running Ubuntu 20.04 LTS, Samsung NVMe SSD 870 Evo Plus entire physical disk using NVMe controller type.

=> On intensive disk usage, I get "nvme: QID Timeout, aborting"

I tried various kernel/module parameters tuning, to turn off power saving regarding pcie/nvme core driver, use IO schedulers, etc... Nothing removes these timeout.

I saw someone stating that the scsi controller was better regarding performance, but I was wondering if someone is using the NVMe controller type on Windows guest / Linux host, with a dedicated entire disk, and if it works?

0 Kudos
9 Replies
scott28tt
VMware Employee
VMware Employee

@Mickou06 

Workstation version?

 


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
0 Kudos
Mickou06
Contributor
Contributor

Oups! I’ve modified the title, it’s the latest 16!

0 Kudos
Mickou06
Contributor
Contributor

Ok, I've made a radical move:

I kept the Western Digital WDS 500 Nvme for VMware using NVme controller,

I used the Samsung 870 Evo Plus to install Ubuntu 18.04, in dual boot with windows.

In Native Ubuntu 18.04, no more QID Timeout completion polled/aborting errors, when using intensively the Samsung NVMe drive, nor the Western digital one that I have mounted from this native install.

So theses NVMe timeouts are really to due to Windows + VMWare using NVMe controller, and hopefully my drives are not the culprits 🙂

=> There's something wierd with VMWare NVMe controller...

 

wd123
Enthusiast
Enthusiast

I can confirm that the Workstation Pro 16 virtual NVMe interface leaves something to be desired.  I think it's important to understand that the underlying storage backing on the host of the guest's virtual NVMe is probably irrelevant.  Why so I believe this?

Take a modern linux VM (kernel 5.4.92) configured with a virtual NVMe storage adapter.  Especially under load, the VM will see QID timeout messages, which are associated with a temporary pause of the VM kernel.

Screen Shot 2021-02-01 at 5.53.00 PM.png

Now, at those times look at the logs on the host OS for indicators of storage problems.  In my case, there are none.  But let's dig deeper...

Finally, take that same VM that is known to produce QID timeouts, shut it down, and then in VMware Workstation:

  1. Remove the NVMe controller.  The VMDKs will be left behind.
  2. Add a SATA controller, using an existing VMDK as the disk backing.
  3. Tweak grub if necessary to handle the change in the root.

Now you have the exact same guest OS, whose disk is provided by the exact same physical disk backing on the host OS.  What's the difference though?  NO QID timeouts, other disk-related errors, or other unexpected brief pauses in the VM.

Conclusion: The virtual NVMe interface provided by VMware Workstation is not ready for prime time.  And ironically, it may be slower than other (SATA/SCSI) interfaces due to periodic hangs of the guest OS.  It's not clear why the NVMe device isn't working well with a Linux guest.  It's possible that the interface is fine, but that Linux is making an incorrect assumption about its operation that results in these QID timeouts.  Or possibly the virtual NVMe interface is inherently flawed in some way in the current VMware Workstation, and the QID timeouts are simply the noticeable symptoms of it.

Either way, I'd avoid using virtual NVMe drives in the current version of VMware Workstation until the problem is fixed on the VMware and/or Linux side.

Mickou06
Contributor
Contributor

Yep, you're right...

The problem is that using the SCSI controller, you get something like this:

NVME disk => Windows 10 Host NVMe driver => VMWare SCSI controller => VMWare Guest Ubuntu SCSI disk

So at some point, the VMWare SCSI virtual controller is translating SCSI commands to NVMe commands, so a lot of stuff are, let's say, mapped to a equivalent feature, or not mapped at all (like smart attributes, disk monitoring like internal temps), and I don't know what happen regarding NVMe specific commands dedicated to the NVMe life saving (let's say the TRIM equivalent, if such NVMe features are supposed to handle discard of files and so on).

After reaching the same conclusion than you, I looked at Windows DDA, for Direct Device Assignment, to test passing the whole NVMe drive to VMware, but this feature is supported on Windows Server only, and VMWare ESXi... So I moved to a native Linux installation, because I make an intensive use of SSD (linux related compilations) and I don't want to break by NVMe disk in few months... But the comfort of using Linux in Windows is missing me 🙂

cocus
Contributor
Contributor

I'm still having these issues on my machine. I have a dedicated nvme disk that VMware 16 is using just for my Linux (Ubuntu 18.04 LTS) guest (Windows 10 Host). I started having these issues after migrating from vmware 15.

I tried to play with the W10 hypervisor thing, but that only made things worse. I'm out of clues on what to do, since these constant "OS freezes" are really annoying. Does using a SATA host would make them go away without losing that much performance and features like TRIM?

 

Thanks!

0 Kudos
wd123
Enthusiast
Enthusiast

If you configure the guest OS to have a virtual SATA adapter as opposed to NVMe, then the performance problems will (ironically) disappear.

If you had a real NVMe on the host AND the guest had a virtual NVMe adapter AND the virtual NVMe adapter worked as expected, then using a SATA virtual adapter would be non-optimal.  But we don't live in such a world last I checked, so that's sort of moot.

cocus
Contributor
Contributor

>If you configure the guest OS to have a virtual SATA adapter as opposed to NVMe, then the performance problems will (ironically) disappear.

Good to know, I thought so. I was afraid of the TRIM features since I don't want to nuke my nvme.

 

>If you had a real NVMe on the host AND the guest had a virtual NVMe adapter AND the virtual NVMe adapter worked as expected, then using a SATA virtual adapter would be non-optimal.  But we don't live in such a world last I checked, so that's sort of moot.

 

Indeed, I agree with you.

 

As a side note, I couldn't tolerate this anymore, and rolled back vmware 15. No issues whatsoever and performance is in top notch as I expected. I know it's not optimal to run this old software, but hey, it works!

In fact, I didn't even stop my guest machine. I suspended it, uninstalled vmware 16, installed vmware 15, re-started the guest vm, and everything is working fine. Not a single timeout since I downgraded, and I should've seen at least 10 timeouts by this time with vmware 16.

dlhtox
Contributor
Contributor

As an FYI, here is my performance in Workstation 17.

NvME seems to be strong in some areas and not as much in others.   Also, the top speed we see if 2827 on read and 2629 on write is nowhere near the speed of my NvME host drive.  I built a new system with the latest and greatest NvME mainly for VMWare Workstation and it is disappointing it is not even half as fast as the host NvME.

Hyper-V tests out around the same as the host drive.

I have a very large program I install and it takes 21 minutes to install on VMWare Workstation 17 and 15 minutes on Hyper-V.

dlhtox_0-1669761885039.png

 

 

0 Kudos