SMP vs Multithreaded Applications

RParker · ‎04-04-2008

I found an article that's a bit old, but still applies. The end result of these tests show that 2 equal machines is better than a single machine split between 2 processors. The same holds true for VM's. It's better to have 2 VM's (separate but equal) to divide up the application load.

Multithreaded apps (which make up over 95% of ALL applications -- perhaps even higher) and people get confused that applications can run multithreads can automatically run them on multiprocessors, which just isn't true. SMP is the ability to run parallel instructions on 2 CPU simultaneous. Adding 2 CPU to a VM not only increases overhead, but it will not give any benefit. There are VERY few SMP applications, almost any good commercial application is multithreaded however, but they are NOT the same thing.

This is a snipet of the following website, but processors and applications have not changed in many years, even 64-bit code isn't prevalent yet (our company doesn't even develop for 64-bit due to the lack of customers that do not see a reason to move to the 64-bit platform)

Expectations

Expectations towards such an SMP machine are rather high, the bragging word goes that a dual 366 MHz setup will trash a 650 MHz Pentium III or even an Athlon 600 in performance.

Such expectations will be the first casuality of real-life contact with a BP6 machine. Ideally and for very few, selected applications, a SMP setup will double the processing power of a uniprocessor machine of same clock speed.

Among these applications are RC5, distributed.net's encryption cracking client. What enables it to this outstanding performance is multithreading (spawning of tasks for each PU) and very low memory access demands.

Other tasks are not as accomodating, especially it is rather difficult to find fully multithreaded applications that will naturally embrace SMP without any user interference. Here is where multitasking and the operating system enter the image.

Multiple tasks can be distributed across the available PUs if the operating system supports multiple PUs and implements a load-balancing that will shift work to the less occupied PU swiftly and on the fly.

Homogenous Multitasking

Under this category I use the SETI@Home Linux client, running a client on each PU to compare the execution times per work units with a single task down below.

Since SETI employs a very memory limited algorithm, the FSB is under heavy stress and hence a lot of competition takes place between the PUs.

2x SETI@Home

16:12:20 hours

Additionally I use POVRay in two instances to calculate one half (left and right) of the same image with one process on each CPU.

This requests a lot of information from the memory subsystem at all times as the ray touches varying parts of the world model and needs informations about the material properties.

2x POVRay 3.0

23:14 min

The results have to be put in perspective with uniprocessor results, check down below for the comments.

For POVRay it is imporant to remark that it doesn't multithread on it's own and image splitting must be done with the help of a short script (POVRay supports partial rendering, though) and combined at the end of the run.

This approach makes sense for very long rendering runs, for animations it is better to render complete even frames on one CPU and complete odd frames on the other.

Heterogenous Multitasking

In this category I run one instance of SETI@Home and one instance of POVRay on the system in parallel for as long as it takes SETI@Home to finish.

Both are single threaded applications that are scheduled by the OS to run on different PUs while both use a lot of memory accesses to fetch and deposit data.

The POVRay serves as a blind load that won't be measured while we look closely at the SETI@Home result to compare it with the uniprocessor and homogenous multitasking outcome.

Application	Time elapsed	loss/gain
SETI@Home uniprocessor	11:01:16	0 %
SETI@Home and POVRay	12:12:40	-10.8 %
2 instances of SETI@Home	16:12:20	+36.0 %

As you can see the load on the GTL+ FSB is the crucical bottleneck in the SMP system, heavy memory accesses that won't fit the L2 cache will trash the performance by almost 50%.

On the other hand a thoughtful combination of tasks can yield significantly improved results over uniprocessor systems of same speed.

Single applications

Finally as a comparision I run one instance of SETI@Home and one instance of POVRay to see how SMP is comparing to the classical uniprocessor results on the same platform. Also added are the one thread results from the kernel compile and RC5 tests.

Application	uniprocessor	SMP	speedup
RC5-64	3:43:26	1:51:54	99.6 %
POVRay 3.0	43:09	23:14	85.7 %
kernel compile	4:35.104	2:44.687	67.0 %
SETI@Home	11:01:16	16:12:20	36.0 %

Coclusions

We see that the benefits of SMP vary strongly based upon the applications run. Multithreaded applications are the best but a rare find still - but only if you're singletasking.

Under multitasking conditions the operating system is well capable of acceptable load-balanced scheduling. In this scenario it is of more importance to accomplish a good mix of memory-heavy and memory-light applications to make optimal use of the addtional CPU power.

From our scores we can say that an SMP machine can make a good rendering machine for i.e. POVRay and even scientific heavyweights like FFT and cryptoanalysis can show a good gain.

Software development, especially with many source modules, will also see a significant productivity leap. This goes twice for the tedious phase of debugging where long compiles in between short test runs and code alterations can be quite aggravating.

FFT, because of the heavy FSB demands, can benefit more from running on two equally fast individual machines, but the extra cost may be prohibitive - the solution to this may be efficient distributed computing - unlike SETI@Home's.

oreeh · ‎05-20-2008

FYI: this thread has been moved to the Performance Forum.

Oliver Reeh[/i]

[VMware Communities User Moderator|http://communities.vmware.com/docs/DOC-2444][/i]

All

SMP vs Multithreaded Applications