-
Notifications
You must be signed in to change notification settings - Fork 901
P vs E cores in Open MPI #11345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I remember raising this while at Intel - IIRC, the answer was "nobody should be using these processors for MPI". Not really designed for that purpose. Best we could devise was to use the "pe-list" option to select only the p-cores as the e-cores are pretty useless for this application. It's a workaround, but probably your best answer if you insist on using such processors for HPC. My guess is that someone is just trying to run code on a laptop for test purposes - in which case, restricting to the p-cores is probably just fine. |
I am fine with using only the P cores for Open MPI. I do not have access to such a processor and I do not know how |
FWIW, I asked the user to run |
I honestly don't know how it is presented. I couldn't get the processor team to have any interest in hwloc support back then. The processor was designed in partnership with Microsoft specifically for Windows (which MS custom optimized for it), and MS had no interest in hwloc support. I'm guessing hwloc should still be able to read something on it anyway. If they have hwloc on that box, then just have them run |
there is something fishy here: according to the description, it should be 16 cores (8+8, unlike 8+4 I wrote earlier) and 24 threads (8*2+8), but Open MPI does not report this. I am now clarifying this and I guess I'll then have to wait for @bgoglin insights. |
Hello. hwloc reports different "cpukinds" (a cpuset + some info). We don't tell you explicitly which one is P or E (sometimes there are 3 kinds on ARM already), but kinds are reported in an order that goes from power-efficient cores to power-hungry cores. This is in hwloc/cpukinds.h since hwloc 2.4. You likely want to call hwloc_cpukinds_get_nr(topo, 0) to get the number of kinds, and then call hwloc_cpukinds_get_info(topo, nr-1, cpuset, NULL, NULL, NULL, 0) to get your (pre-allocated) cpuset filled with the list of power-hungry cores. This should work on Windows, Mac and Linux on ARM, Intel AlderLake and M1 although the way we detect heterogeneity is completely different in all these cases. |
Thanks @bgoglin, I will experience on a M1 (since this is all I have) to see how I can "hide" the E cores from Open MPI. |
@bgoglin just to be clear, does |
So when you call get_info(), pass an "int efficiency" in hwloc_cpukinds_get_info(topo, nr-1, cpuset, &efficiency, NULL, NULL, 0) and check whether you get -1 in there. |
You cannot trust those dots, @ggouaillardet - the print statement isn't that accurate (it's actually rather dumb, to be honest).
I already told you - you just have to list the PEs you want to use. It would take a significant change to PRRTE (or ORTE for an older version) to try and do this internally. I doubt it would be worth the effort - like I said, these chips are not intended for HPC, and won't run well in that environment. |
only use the performance cpus when true (default is false) requires hwloc >= 2.4 Refs. open-mpi#11345 Signed-off-by: Gilles Gouaillardet <[email protected]>
Thanks for helping me to post my question here. I didn't intend to do real HPC job on this laptop but want to take advantage of the multiple cores to speed up some data processing (40k+ satellite data files and 200k+ model output files, less than 100M each). The processing is pretty repetitive and is perfect for lazy parallelization. The issue is that OpenMPI does not recognize the cores correctly. So I am not sure how it does the scheduling. OpenMPI complains when I set It would be great if I can use all 16 cores. If not, having some control over which cores to use would be ideal, for example, use p-cores for faster processing and e-cores for thermal concerns. |
Yeah, I kinda figured that was the situation. You have a few simple options:
|
Thanks for the reply, although I am not sure I can follow. What confuses me is that OpenMPI (and/or Ubuntu 22.04) can only see 12 cores (12x2 threads=24) although there are actually 16 cores (8x2+8x1=24 threads). If it gets the total number of cores wrong, it may mess up the scheduling to the cores too (missing four cores). If OpenMPI can only see 12 cores, I assume What if I want to use all 16 cores, with one process on each? OpenMPI complains if I use Another question is why ? |
You are overthinking things 😄 If you simply run You shouldn't care what hyperthread gets used for any given time slice by whatever process is being executed during that time slice. The OS will manage all that for you. This is what the OS does really well. Trying to do any better than that is a waste of your energy. It doesn't matter what mpirun "sees" or doesn't "see". It's sole purpose is to start N procs, and then get out of the way and let the OS do its thing. Asking mpirun to try and optimize placement and binding on this kind of processor will only yield worse results. |
Thanks, @rhc54 . I was worried that the OS is confused too because the Ubuntu 22.04 (5.15.79.1-microsoft-standard-WSL2) also sees only 12 cores (24 threads), although the host Windows 11 recognizes the CPU correctly.
|
Understood. The problem is that we cannot do any better than your OS is doing. No matter what options you pass to mpirun, I'm limited to what the OS thinks is present. What you are seeing is the difference between Windows (being optimized to work with this architecture) and Ubuntu (which isn't). There is nothing anyone can do about that, I'm afraid - unless someone at Ubuntu wants to optimize the OS for this architecture, which I very much doubt. Your only other option would be to switch to Microsoft's MPI, which operates under Windows. I don't know their licensing structure and it has been a long time since I heard from that team (so this product might not even still exist) - but if you can get a copy, that would support this chip. Otherwise, the best you can do is like I said - just run it oversubscribed (with however many procs you think can run effectively - probably an experiment) and let the OS do the best it can. |
Are you running native Linux? if the latter, that could explain why
|
@ggouaillardet I am using Ubuntu 22.04 in WSL2. The kernel version is 5.15.79.1-microsoft-standard-WSL2. |
Last time I saw hwloc running on WSL on Windows, Windows/Linux was reporting correct information in sysfs hence hwloc too. But I never tried on a hybrid machine. What's wrong above is lspcu. Either because Windows/Linux reports something wrong, or because lspcu isn't hybrid-aware yet. It sees 24 threads in the socket, 2 threads in first core, and decides that means 24/2=12 cores. Running lstopo would clarify this. Or at least ̀ |
I'm not sure I agree with the assertion that lspcu is doing something "wrong". WSL isn't "limiting" the number of cores - it is simply logically grouping the available hyperthreads into two-HT "cores" - i.e., you have 12 "cores", each with 2 HTs. Native Ubuntu is logically grouping them into 8 "cores" each with 2HTs, and 8 "cores" each with 1HT. It all just depends on how the OS intends to manage/schedule the HTs. Neither is "correct" or "wrong" - they are just grouped differently. If you have hyperthreading enabled (which you kinda have to do with this processor), it really won't matter as the kernel scheduling will be done at the HT level - so how they are grouped is irrelevant. What matters is if and how the kernel is scheduling the p-cores differently from the e-cores. IIRC, Windows was customized to put compute-heavy process threads on the p-cores, and lighter operations on the e-cores. So as your job continued executing, it would migrate the more intense operations to the p-cores (e.g., your computational threads) and the less intense ones to the e-cores (e.g., threads performing disk IO, progress threads that tend to utilize little time, system daemons). I'm not sure how Ubuntu is optimized - probably not as well customized, so it may just treat everything as equal and schedule a thread for execution on whatever hyperthread is available. Or it may do something similar to what Windows is doing. Point being: the processor was designed with the expectation that the OS would migrate process threads to the "proper" HT for the kind of operations it was performing. In this architecture, the worse thing you can do is to try and preempt that migration. Best to just let the OS do its job. You just need to add the "oversubscribe" qualifier to the --map-by directive so that mpirun won't error out if you launch more procs than there are "cores" (or HTs if you pass the --use-hwthread-cpus option). |
@bgoglin I think you are right that it might not be due to WSL limiting the number of available cores. If WSL limit the number of cores, it shouldn't see 24 threads. But |
The number of threads (24) is correct, so WSL is not limiting anything. Here is the info I requested on SO:
|
We seem to be spending a lot of time chasing ghosts on this thread, so I'll just crawl back under my rock. There is no limitation being set here. OMPI sees the same number of HTs on each system you have tried. mpirun just needs to be told to consider HTs as independent cpus so it can set oversubscription correctly. You don't want to bind your procs - you need to let the OS manage them for you. That is how the processor was designed to be used.
|
lstopo_output.log
|
@rhc54 Thanks a lot for the explanations! I think I am more at ease when I use the OpenMPI on this machine now. |
thanks, I confirm So I am afraid there is no trivial way to use 8P + 8E cores (e.g. ignore the second hyperthread on the P cores). |
@ggouaillardet Thanks a ton! This helps a lot. |
I just saw this question in Stack Overflow
https://stackoverflow.com/questions/75240988/openmpi-and-ubuntu-22-04-support-for-using-all-e-and-p-cores-on-12th-gen-intel-c
TL;DR on a system with 8 P cores (2 threads each) and 8 E cores (1 thread each), is there a (ideally user friendly) way to tell Open MPI to only use the P cores?
@bgoglin what kind of support is provided by
hwloc
with respect to P vs E cores?The text was updated successfully, but these errors were encountered: