Kernel's NUMA balancing vs. OpenMPI's affinity 

Hi, 

I'm trying to figure out the interaction between automatic NUMA balancing and OpenMPI's process affinity implementation. 
I would like to deduce potential outcome  before carrying out actual benchmarks. 

As I understand: 
1. The kernel attempt to maximize data locality by periodically profiling memory access pattern of and migrating threads/pages. 
2.  By default, Open MPI binds processes either to core (n < 2) or to socket (n >= 2). Threads can also be mapped to various hardware hierarchy such as socket, numa, l1cache, l2cache and l3cache. 

Considering Zen 3, where 8 cores in a CCX sharing L3 cache, ```--map-by l3cache``` can ensures maximal data locality between threads per the first touch policy. In such a case: 
- (1) becomes redundant at best and OS noises at worst. 
- (1) makes sense when disabling (2)  with ```--bind-to none```. 

Heterogeneous system adds an another layer of complication. 
- Best performance is archived when respecting cpu-gpu affinity 
- UCX also has affinity detection to make sure processes stays close to HCA. 

For the benchmark, the trend should be as follow: 
a. no auto-NUMA, ```--bind-to none``` as base line 
b. auto-NUMA,  ```--bind-to none```
c. no auto-NUMA, OpenMPI's affinity 
d. auto-NUMA, OpenMPI's affinity  

Depending on workload, (a), (b) should show some performance variation with repeated runs. (c) would be slightly better than (d) due to the absence of profiling overhead from kernel. 

I appreciate if you can give some comments and share your insights into this matter. 

Thanks. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kernel's NUMA balancing vs. OpenMPI's affinity #11357

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kernel's NUMA balancing vs. OpenMPI's affinity #11357

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions