We have dual XEON CPUs with 6 cores each. However, the default BIOS setting of having HyperThreading enabled meant we were shown 24 cores. When I ran an analysis, only one of the 24 cores was being used. Thus, I thought that maybe if I disable HyperThreading, I might see a doubling of solution speed. That was not the case.
I tested a basic model with 80k elements, including CGAP, RBE2 and RBE3, brick and shell elements. Each run includes four matrix factorization contact iterations, so a good part of the solution time is spent in parallel. The following two graphs show the results vs. absolute number of threads, and then vs. percentage of apparently available cores (12 without HyperTreading, 24 with).
The first graph shows that if you use more than the physical number of cores on a single CPU (6 in this case), the HyperThreading solution time starts increasing. In almost all other cases (except slightly in the single thread case) HyperThreading provides an advantage. In fact, using 6 threads with HyperThreading is faster than using 12 (all available cores) without HyperThreading in this case.
So, keep HyperThreading on, and set the number of threads equal to the number of physical cores on a single CPU (in this case 6 threads for 6 physical cores on one of the two available CPUs).
Let me check if I understand you right: "in a single CPU with 6 cores + hyperThreading ON is faster to set PARALLEL=6 than PARALLEL=12"??.
Yes. Maybe the model is too small to benefit from more than 6 threads, or, maybe once both CPUs are in use there is some local cache optimization that is no longer possible.
A late think. I didn't even look at the graphs. They may be telling something, but I don't think it is useful.
Have anyone ever saw 8 processes/threads on a quadcore running at 13% processor load each? I don't think so. There are at most 4 threads running at the top speed of 13%.
That is, HT is very useful for the everyday use of mundane tasks. When you are looking to load a processor to its max it is less so - you should never forget that on a, say, quad core there are only 4 physical cores.
Schedulling 4 threads on a quad core will usually make only one of them competing with other processes running on the system; the rest of 3 are free to go at full physical core speed. Schedulling 8 threads will make them all competing against each other for the only 4 physical cores, not to mention competition to the other processes. There will be a lot more context switches, but nevertheless pairs of two threads will have to time share the same physical core - HT will help in principle, but there is anyway an overhead with the sharing.
The above scales nicely to any number of physical cores other than 4.
I always set the number of processors to the number of physically available cores for Nastran analyses. I don't expect this to lead to a 2 or 10x gain in speed, but I am sure I will see no performance loss.
We have dual six-core processors, and from the data it seemed that once the job spilled over onto the second processor (i.e. more than 6 threads) that there was actually a performance hit with HT enabled. At the 6-thread point, HT and no-HT were exactly the same. So for 6 threads or less (i.e. one chip) HT was faster.
If you have one chip, I think having HT enabled would always be faster, but that's better left to testing.
I think there are some miss-understandings in between.
I would never disable HT - I just said that I configure Nastran to run with the number of physical cores to avoid the need for HT. It is there, it has nothing to do.
I don't know whether it is a special Nastran management for processor afinity or it is implicitly handled by the system (I would suspect the latter, as it happens with other programs too), but one can observe the following behaviour in Task Manager with a single chip, 4 phys. cores/8 logical cores: specifying growing numbers of threads get the logical processors busy in the following order: 0,2,4,6,1,3,5,7.
That is, the system knows that, e.g., logical cores 2 and 3 are on the same physical core and will avoid to
use them both if possible.
Please note that HT only applies to threads on a single pair of logical cores on the same physical core; physical cores run threads fully independently.
I find it kind of hard to follow your conclusions for dual chips, 6 phys.cores (more than likely /12 logical cores) each. At least for precisely 6 threads, HT looks to be both exactly the same and faster than no-HT. Anyway, I suspect that for your configuration the system willl gradually schedule the following logical cores: even cores on chip 1 (Taskman. processors 0 to 10), even cores cores on chip 2 (Taskman. 12-22), odd cores on chip 1 (Taskman. 1-11), odd cores on chip 2 (Taskman. 13-23).
It seems somehow that you avoid passing the 6 threads border and that something bad happens beyond it, with a 7th thread or even more. Did you really experienced degraded performances with more than 6 threads? I can only guess this would be due to concurrent memory access from 2 distinct chips; I don't think the scheme above is not applied (all logical cores of the first chip are not first employed all before employing chip 2).
Note again that Nastran is also disk-intensive and comparing performances in number of threads alone may not be very representative. Disk operations will degrade for all multithread options equally: multichip, multi physical cores, multi logical cores.
So, finally, any one reading my previous post on the topic and this one as well should take them as either:
- a personal recommendation: don't get enthusiastic with large numbers of logical cores - schedule only the number of physical cores for Nastran threads (One can also leave one for the system, but there are always leftovers than can be used by Nastran when you are in much need).
- a request for comments on Nastran multithreading habits: to whomever has a better understanding of modern processors.