Large message collective performance drops when using coll/han #9062

hjelmn · 2021-06-14T15:26:18Z

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from source

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version: CentOS Linux 7.9
Computer hardware: Intel(R) Xeon(R) CPU
Network type: 50 GigE

Details of the problem

I am working to tune Open MPI on a new system type. By default coll/tuned is being selected and is giving so-so performance:

mpirun --mca btl_vader_single_copy_mechanism none --mca coll_base_verbose 0 --hostfile /shared/hostfile.ompi -n 128 -N 16 --bind-to core ./osu_allreduce
App launch reported: 9 (out of 9) daemons - 112 (out of 128) procs

# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)
4                     599.60
8                     321.18
16                    481.45
32                    483.63
64                    483.59
128                   567.62
256                   472.35
512                   431.96
1024                  609.19
2048                  288.70
4096                  355.52
8192                  425.21
16384                 546.61
32768                 739.76
65536                1501.53
131072               2027.41
262144               1015.34
524288               1328.23
1048576              2101.48

The large messages look ok but small messages are not great.

When forcing coll/han things look way better for small messages at a huge cost to the large message performance:

mpirun --mca btl_vader_single_copy_mechanism none --mca coll_base_verbose 0 --hostfile /shared/hostfile.ompi -n 128 -N 16 --bind-to core --mca coll_han_priority 100 ./osu_allreduce
App launch reported: 9 (out of 9) daemons - 112 (out of 128) procs

# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)
4                     111.77
8                     112.46
16                    111.98
32                    233.86
64                    198.94
128                   321.43
256                   286.42
512                   212.69
1024                  305.23
2048                  257.34
4096                  332.50
8192                  317.34
16384                 359.07
32768                 432.56
65536                 729.18
131072               1102.87
262144               1801.27
524288               3301.01
1048576              6245.48

Is this expected? Another MPI on the system is getting 74us for the small messages (below 1k) and 1400us for 1MB messages.

The text was updated successfully, but these errors were encountered:

hjelmn · 2021-06-14T15:34:36Z

I should have added that this is with 8 nodes.

hjelmn · 2021-06-14T17:44:17Z

--mca coll_han_use_simple_allreduce true helps somewhat:

App launch reported: 9 (out of 9) daemons - 112 (out of 128) procs

# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)
4                      99.16
8                      99.97
16                    100.25
32                    101.31
64                    105.44
128                   105.94
256                   112.12
512                   113.90
1024                  121.76
2048                  136.25
4096                  142.46
8192                  223.59
16384                 248.33
32768                 292.69
65536                 401.75
131072                845.64
262144               1584.11
524288               2993.50
1048576              5885.93

gkatev · 2022-03-14T10:13:55Z

For whatever it may be worth, I recently ran some HAN benchmarks, and saw great improvement in large-message Allreduce latency by increasing the segment size (MCA coll_han_allreduce_segsize), from the default 64K to 1M. (with the non-simple implementation)

hjelmn assigned bosilca Jun 14, 2021

bosilca assigned devreal Jun 17, 2021

bosilca added the performance label Jun 17, 2021

bosilca added this to the v5.0.0 milestone Jun 17, 2021

devreal mentioned this issue May 13, 2022

Coll/HAN and Coll/Adapt not default on 5.0.x #10347

Closed

jsquyres modified the milestones: v5.0.0, v5.0.1 Oct 30, 2023

janjust modified the milestones: v5.0.1, v5.0.2 Jan 8, 2024

jsquyres modified the milestones: v5.0.2, v5.0.3 Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large message collective performance drops when using coll/han #9062

Large message collective performance drops when using coll/han #9062

hjelmn commented Jun 14, 2021

hjelmn commented Jun 14, 2021

hjelmn commented Jun 14, 2021

gkatev commented Mar 14, 2022

Large message collective performance drops when using coll/han #9062

Large message collective performance drops when using coll/han #9062

Comments

hjelmn commented Jun 14, 2021

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

hjelmn commented Jun 14, 2021

hjelmn commented Jun 14, 2021

gkatev commented Mar 14, 2022

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.