VMware Cloud Community
chrbuhl
Contributor
Contributor

Improving threadripper performance in vmware player by disabling SMT(hyperthreading)?

Hi

I am using vmware player in windows 10 for a virtual workstation using ubuntu 18/freesurfer - in order to process MRI-images.

Currently I have a 1920x 12core/24 threads threadripper CPU and have SMT (hyperthreading) enabled.

The VMware player has a max of 16 vCPUs - and I can see from the ressource monitor, that the workstation has access to approximately 2/3 of the cpu-capacity during high load processing.

I was wondering...

If the cpu was upgraded to a 16-core threadripper CPU, SMT was disabled in bios and I gave the VM e.g. 14 vCPUs. Would the VM then have access to 14 full cpu cores instead of the current 8 with SMT?

Reply
0 Kudos
5 Replies
bluefirestorm
Champion
Champion

There is no clear cut answer.

But as a general rule,

2 physical CPU cores will outperform 2 threads in a single CPU  core as the 2 threads in the same CPU core are still sharing and competing for resources within the core (L1/L2 cache, execution engine). That is the case with Intel Hyperthreading; very likely the same with AMD SMT.

Almost every application has a point wherein adding more resources (be it CPU core/threads or RAM) that no further improvement can be achieved or the improvement is no longer commensurate to the resource(s) added as a point of diminishing return is hit. In some rare cases, there might be even a drop in performance (the phrase "too many cooks spoil the broth" comes to mind).

Looking at the requirements of FreeSurfer

https://surfer.nmr.mgh.harvard.edu/fswiki/DownloadAndInstall

It looks like it is dependent on OpenGL (presumably for image rendering).

I don't know where the bottleneck(s) (if there is one) lies during the high load processing you see, but if it is in the rendering of the image, changing CPUs will not help. Upgrading the host graphic card is the way to go if the bottleneck is in the OpenGL performance within the VM. VMware Player 12.x/14.x/15.x makes use of DX11 to deliver the OpenGL 3.3 core profile capability in the Linux VM.

I would suggest try some tests between different configurations: 16 vCPU with SMT, 12 vCPU with SMT, 8 vCPU without SMT. You have to see if there is any significant difference in performance of the application. You could try also 12vCPU without SMT but that might starve both the host and VM. These tests might give you an idea whether it is worth changing the CPU.

Reply
0 Kudos
chrbuhl
Contributor
Contributor

Thank you for your input...At the moment I run the analysis in 8 different terminal windows at a time using an "openmp" option with 2 threads.

Using 1 terminal windows: analysis with 8 cores/16threads using openmp 16 and "parallel"-option is appr. 4hours pr. MRI-sequence/terminal window = 1*(24/4) = 6 MRI-seqs/24hours

Using 2 terminal windows: analysis with 4 cores/8threads using openmp 8 and "parallel"-option is appr. 5hours pr. MRI-sequence/terminal window = 2*(24/5) = 9,6 MRI-seqs/24hours

Using 4 terminal windows: analysis with 2 cores/4threads using openmp 4 and "parallel"-option is appr. 6hours pr. MRI-sequence/terminal window = 4*(24/6) = 16 MRI-seqs/24hours

Using 8 terminal windows: analysis with 1 core/2threads using openmp 2-option is appr. 9 hours pr. MRI-sequence/terminal window = 8*(24/9) = 21 MRI-seqs/24hours

I was wondering if upgrading to 16 cores, disabling SMT and assigning 14 vCPUs to vmware/ubuntu would lead to 14 full cores being assigned to the VM. Then the analysis would be able to be run in 14 different terminal windows at a time and yield something like:

Using 14 terminal windows: analysis with 1 core/1threads using openmp 1-option: Might be appr. 11 hours pr. MRI-sequence/terminal window = 14*(24/11) = 30-31 MRI-seqs/24hours

or using the current CPU with 11 out 12 cores without SMT could yield:

Using 11 terminal windows: analysis with 1 core/1threads using openmp 1-option: Might be appr. 11 hours pr. MRI-sequence/terminal window = 11*(24/11) = 24 MRI-seqs/24hours

I quess that only experimenting with the current setup could give the answer prior to ordering a new CPU with more cores...

Reply
0 Kudos
bluefirestorm
Champion
Champion

I don't know if VMware Player will allow you to set to 11 or 14 vCPU. I look at VMware Player 15.0.2 and the drop-down list does not have 11 or 14. You could edit the vmx and see if it allows the VM to power up with either 11 or 14. I don't have a system that has more than 4c/8t. If it does not allow setting to 14 vCPU, if you change CPU, maybe you could go up to 12 only.

For the different configuration tests; what I had in mind was

16vCPU with SMT should be around 20%-30% better than 8vCPU without SMT

12vCPU with SMT should be almost the same as 8vCPU without SMT

This is on the assumption that hyperthreading/SMT would give around 20-30% boost; and the vCPU will not necessarily be using a physical core all the time.

I know nothing about FreeSurfer so I might be talking non-sense. But my guess when you run the analysis on a Ubuntu Terminal window that would mean no graphics rendering so the host GPU is not an issue.

You are using an option called "openmp" so that would likely mean OpenMP multithreading library. Assuming that to be the case I think the terminal parameters should be using multithreading to gain any advantage. So using 1 core/1 thread per Terminal window might be defeating the purpose of using the openmp option.

So it looks like you might also need to find some sort of optimal number of Terminal window and core/thread combination (more tests!!!).

Reply
0 Kudos
chrbuhl
Contributor
Contributor

Thank you for your input... I didn't consider the pre-described options in the cpu-drop-down menu. Last I was looking at virtualbox, which has slide-bar up to 24 v-cpus...

OpenGL performance is not important... Freesurfer analysis is raw computing power utilizing CPU and memory bandwidth to crunch A LOT of numbers slowly building 3D-models of the brain.

What do you think is best... 12+0 "full cores" without SMT or 8 cores+8add threads using SMT?

If SMT makes a 20-30% cpu-boost, then the difference is rather neglible ~15%. (8 cores with SMT= 1,3*8= 10,4 & 12 cores without SMT = 12*1 = 12).

Reply
0 Kudos
bluefirestorm
Champion
Champion

The thing with parallel/multithreading programs/applications and its performance, it can be affected by different factors. Some things just cannot be parallellised as results of one process is used as input to another process. There is also the complexity of the algorithms itself that may not make it easily parallellised or can only be partially parallellised. So some test run is really preferred to get some result to see what sort of improvement can be expected.

I realised that the Freesurfer tests will not be trivial and will probably take a long time (your other reply displayed hours). Have you checked with the Freesurfer documentation or authors or other Freesurfer users if there is any rule of thumb such as given n CPUs m threads, the optimal way to run analysis is "openmp" with x cores/y threads with z Terminal sessions.

For the vCPUs, you could go to the VM folder and edit the VM's configuration file. The VM has to be powered off before you edit.

I tried and changed it to 7

numvcpus="7"

to an Ubuntu 16.04 VM, the the "About this computer" showed x7 and "System Monitor" was showed 7.

So it might not be a problem at all for 11 or 14.

Let's say with "11" vCPU without SMT and with 14 vCPU without SMT the performance result/difference is still linear according to the number of vCPUs (everything else being equal/same, such as same vCPU/RAM ratio, number of terminal sessions, openmp threads, etc), so you could expect the 14 vCPU to perform around 25-30% better than 11 vCPU.

Also you might want to keep a steady vCPU/virtual RAM ratio; example each vCPU allocated 2GB of virtual RAM. Adding more vCPU without adding the corresponding virtual RAM might just result in the vCPUs competing for the virtual RAM resource thereby slowing things down.

Perhaps the 11 vCPU without SMT test might give you an idea what the 14 vCPU without SMT would be like. You could try the 12 vCPU without SMT but that seems to be really pushing it too far as it might starve both the host and VM at the same time.

Reply
0 Kudos