Description
Hi,
I'm trying to figure out the interaction between automatic NUMA balancing and OpenMPI's process affinity implementation.
I would like to deduce potential outcome before carrying out actual benchmarks.
As I understand:
- The kernel attempt to maximize data locality by periodically profiling memory access pattern of and migrating threads/pages.
- By default, Open MPI binds processes either to core (n < 2) or to socket (n >= 2). Threads can also be mapped to various hardware hierarchy such as socket, numa, l1cache, l2cache and l3cache.
Considering Zen 3, where 8 cores in a CCX sharing L3 cache, --map-by l3cache
can ensures maximal data locality between threads per the first touch policy. In such a case:
- (1) becomes redundant at best and OS noises at worst.
- (1) makes sense when disabling (2) with
--bind-to none
.
Heterogeneous system adds an another layer of complication.
- Best performance is archived when respecting cpu-gpu affinity
- UCX also has affinity detection to make sure processes stays close to HCA.
For the benchmark, the trend should be as follow:
a. no auto-NUMA, --bind-to none
as base line
b. auto-NUMA, --bind-to none
c. no auto-NUMA, OpenMPI's affinity
d. auto-NUMA, OpenMPI's affinity
Depending on workload, (a), (b) should show some performance variation with repeated runs. (c) would be slightly better than (d) due to the absence of profiling overhead from kernel.
I appreciate if you can give some comments and share your insights into this matter.
Thanks.