Skip to content

Kernel's NUMA balancing vs. OpenMPI's affinity  #11357

Open
@vitduck

Description

@vitduck

Hi,

I'm trying to figure out the interaction between automatic NUMA balancing and OpenMPI's process affinity implementation.
I would like to deduce potential outcome before carrying out actual benchmarks.

As I understand:

  1. The kernel attempt to maximize data locality by periodically profiling memory access pattern of and migrating threads/pages.
  2. By default, Open MPI binds processes either to core (n < 2) or to socket (n >= 2). Threads can also be mapped to various hardware hierarchy such as socket, numa, l1cache, l2cache and l3cache.

Considering Zen 3, where 8 cores in a CCX sharing L3 cache, --map-by l3cache can ensures maximal data locality between threads per the first touch policy. In such a case:

  • (1) becomes redundant at best and OS noises at worst.
  • (1) makes sense when disabling (2) with --bind-to none.

Heterogeneous system adds an another layer of complication.

  • Best performance is archived when respecting cpu-gpu affinity
  • UCX also has affinity detection to make sure processes stays close to HCA.

For the benchmark, the trend should be as follow:
a. no auto-NUMA, --bind-to none as base line
b. auto-NUMA, --bind-to none
c. no auto-NUMA, OpenMPI's affinity
d. auto-NUMA, OpenMPI's affinity

Depending on workload, (a), (b) should show some performance variation with repeated runs. (c) would be slightly better than (d) due to the absence of profiling overhead from kernel.

I appreciate if you can give some comments and share your insights into this matter.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions