Let's say I have a 4-core CPU, and I want to run some process in the minimum amount of time. The process is ideally parallelizable, so I can run chunks of it on an infinite number of threads and each thread takes the same amount of time. Since I have 4 cores, I don't expect any speedup by running more threads than cores, since a single core is only capable of running a single thread at a given moment. I don't know much about hardware, so this is only a guess. Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?
17.5k 14 14 gold badges 70 70 silver badges 105 105 bronze badges asked Nov 11, 2009 at 22:19 81.3k 46 46 gold badges 198 198 silver badges 229 229 bronze badgesI appreciate your question very much, but I somehow do not understand how is your first assumption relevant to your question? namely this sentence: "each thread takes the same amount of time."
Commented Nov 30, 2020 at 20:23If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
One thing for sure: 4k threads will take longer. That's a lot of context switches.
answered Nov 11, 2009 at 22:28 21.1k 3 3 gold badges 77 77 silver badges 79 79 bronze badgesI think Gonzalo's answer is good. I'd just add that you should experiment and measure. Your program will differ from his, or mine, or anyone else's and only measurements of your own program's behaviour will answer your questions properly. The performance of parallel (or concurrent) programs is not an area where good conclusions can be drawn from first principles alone.
Commented Nov 11, 2009 at 22:34+1, +answer: it surprises me that having many more threads than cores results in better performance, although it makes some sense if more threads means larger chunk of time share compared to competing threads. It would be nice my application could detect differences in performance and automagically tune itself to the optimal number of threads.
Commented Nov 12, 2009 at 15:56It shouldn't surprise you in a real world scenario. Threads block waiting for IO resources like disk access, network, etc. And also waiting for non IO resources like other threads to finish using shared variables. What you really want to achieve is the minimum number of threads such that at least one thread per core can always be running.
Commented Nov 12, 2009 at 17:441 thread per core is not the optimum. It needs to be slightly more, preferably twice that since this will allow another thread to run if a thread is temporarily blocked. Even if only on memory. This is more importnat if you have systems (P4,I7, Sun Rock etc) that feature SMT/HT)
Commented Dec 31, 2009 at 13:54Hence the "That is very likely not the case" in my answer. Finding the right number depends on the application and the architecture it runs on.
Commented Dec 31, 2009 at 15:21I agree with @Gonzalo's answer. I have a process that doesn't do I/O, and here is what I've found:
Note that all threads work on one array but different ranges (two threads do not access the same index), so the results may differ if they've worked on different arrays.
The 1.86 machine is a macbook air with an SSD. The other mac is an iMac with a normal HDD (I think it's 7200 rpm). The windows machine also has a 7200 rpm HDD.
In this test, the optimal number was equal to the number of cores in the machine.
answered May 20, 2012 at 2:55 5,445 5 5 gold badges 35 35 silver badges 47 47 bronze badges+1 for the graph. Clearly 1 thread per core is best, but it's interesting that the quad core system seems to not at higher thread numbers ( Commented Sep 28, 2012 at 16:30
-1 for the graph! Smooth curves through integer-valued x-coordinates? A wild jump from 1 2 3 to 10 20 30 to 50 100? And y-coordinates that are multiples of 10 plus 2 for good measure. This is Excel's doing, isn't it?
Commented Dec 27, 2012 at 15:46 @Spacedman Yes it is. The smooth curves have a much nicer look IMHO. :D Commented Dec 27, 2012 at 16:01@PascalvKooten, The problem is not that it looks pretty, it's deceiving at first glance. First of all the y-axis starts at 42, exaggerating the apparent difference between the tested machines. Secondly, the weird progression of x-axis values suggest that 'time-taken' does not scale linearly with 'number of threads', this is especially true for the blue line. I think the problem others (including myself) have with it is that it misrepresents the data.
Commented Dec 21, 2013 at 22:57@Spacedman The critique on the graph is the most ridiculous thing I have come across in the last 24 hours. The graph helps. A lot. Period. Could it have been done better? No one cares. Smooth curve instead of discrete? That is your problem. I assume, all of you would never include such a graph into their answer because you don't have the extra time/energy to make it look good. That is my point.
Commented Nov 17, 2014 at 1:59I know this question is rather old, but things have evolved since 2009.
There are two things to take into account now: the number of cores, and the number of threads that can run within each core.
With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!) Note that modern CPUs generally have more pipelines to divide the workload, so it's no really divided by two anymore. But Hyperthreading still shares a lot of the CPU units between the two threads (some call those logical CPUs).
On other processors you may have 2, 4, or even 8 threads. So if you have 8 cores each of which support 8 threads, you could have 64 processes running in parallel without context switching.
"No context switching" is obviously not true if you run with a standard operating system which will do context switching for all sorts of other things out of your control. But that's the main idea. Some OSes let you allocate processors so only your application has access/usage of said processor!
From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step. in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed. which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)