-
Notifications
You must be signed in to change notification settings - Fork 900
Kernel's NUMA balancing vs. OpenMPI's affinity #11357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd suggest first just finding out if the app moves at all during execution if So my suggestion is to write an app that detect its affinity and caches it right after it starts, then does some communication and computation while periodically detecting its affinity and comparing that to its initially one. If something changes, print out the new affinity and when it appeared. You can do this for all four of your cases. If nothing changes, then you can reduce your benchmark to just cases a and b, plus one case with |
Thanks for your comments. There is not much information discussing this feature of the kernel. The only two references I could find advise against using auto NUMA balancing:
An HPC application such as QE has different workloads during its course of operation. So I can understand that what the kernel applies for this workload will not be suitable for another.
Here NVIDIA advises tuning the process placement directly with
I gather that there is nothing to prevent the kernel from overruling OpenMPI's affinity setting. Hence, the advices above. For now, I can write a toy model that allocate large data set and monitor the page migration from |
I would advise against generalizing that NVIDIA advice - that cmd line locks all of your processes to the first NUMA node on each machine. I'm not sure why anyone would want to do that on a machine that has more than one NUMA, so I'm guessing that their container is configured somehow to ensure that makes sense. You usually achieve better results with
It would be a rather odd kernel that did so. All OMPI does is tell the kernel "schedule this process to execute only on the indicated CPU(s)". The kernel is supposed to honor that request. Even autonuma respects it:
Which is why you would want to ensure that you let Note that there are times when you do want to let the kernel take over. We discussed this recently on another issue (see #11345). Outside of those circumstances, it is usually better to tell the kernel where the process should run.
Here is a simple code snippet you could periodically run: void print_affinity() {
cpu_set_t mask;
long nproc, i;
if (sched_getaffinity(0, sizeof(cpu_set_t), &mask) == -1) {
perror("sched_getaffinity");
assert(false);
}
nproc = sysconf(_SC_NPROCESSORS_ONLN);
printf("sched_getaffinity = ");
for (i = 0; i < nproc; i++) {
printf("%d ", CPU_ISSET(i, &mask));
}
printf("\n");
} Instead of printing it out, save the initial "mask" you get and then periodically run the code to check the new returned value against the one you obtained at the beginning of execution. If they match, then you haven't moved. So something like this: void check_affinity() {
cpu_set_t mask;
if (sched_getaffinity(0, sizeof(cpu_set_t), &mask) == -1) {
perror("sched_getaffinity");
assert(false);
}
if (mask != original_mask) {
printf("IT MOVED\n");
}
} You can print out the original vs current if you like. Obviously, that is not something you want to do in an actual application - strictly a research tool to see what is happening. |
Pardon me for not being clear regarding the NGC container. The aforementioned command is just an example using 1 GPU.
which translates to Among the benchmarks, performance of HPL is most sensitive to NUMA affinity setting. HGX variant has a very unusual one where 8 GPUs are mapped to 7, 7, 5, 5, 3, 3, 1, 1 domains in AMD EPYC. I believe Open MPI supports this via
Yes, besides HPL, other HPC applications such as VASP, QE does show some performance gain when binding to socket, in case there are multiple GPUs per socket.
It is an oversight in my part. Thanks for pointing this out. It makes sense for the kernel to respect user's request. Thanks for sample codes. We will experiment according to your suggestions. |
Hi,
I'm trying to figure out the interaction between automatic NUMA balancing and OpenMPI's process affinity implementation.
I would like to deduce potential outcome before carrying out actual benchmarks.
As I understand:
Considering Zen 3, where 8 cores in a CCX sharing L3 cache,
--map-by l3cache
can ensures maximal data locality between threads per the first touch policy. In such a case:--bind-to none
.Heterogeneous system adds an another layer of complication.
For the benchmark, the trend should be as follow:
a. no auto-NUMA,
--bind-to none
as base lineb. auto-NUMA,
--bind-to none
c. no auto-NUMA, OpenMPI's affinity
d. auto-NUMA, OpenMPI's affinity
Depending on workload, (a), (b) should show some performance variation with repeated runs. (c) would be slightly better than (d) due to the absence of profiling overhead from kernel.
I appreciate if you can give some comments and share your insights into this matter.
Thanks.
The text was updated successfully, but these errors were encountered: