Skip to content

Large message collective performance drops when using coll/han #9062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hjelmn opened this issue Jun 14, 2021 · 3 comments
Open

Large message collective performance drops when using coll/han #9062

hjelmn opened this issue Jun 14, 2021 · 3 comments
Assignees
Milestone

Comments

@hjelmn
Copy link
Member

hjelmn commented Jun 14, 2021

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from source

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: CentOS Linux 7.9
  • Computer hardware: Intel(R) Xeon(R) CPU
  • Network type: 50 GigE

Details of the problem

I am working to tune Open MPI on a new system type. By default coll/tuned is being selected and is giving so-so performance:

mpirun --mca btl_vader_single_copy_mechanism none --mca coll_base_verbose 0 --hostfile /shared/hostfile.ompi -n 128 -N 16 --bind-to core ./osu_allreduce
App launch reported: 9 (out of 9) daemons - 112 (out of 128) procs

# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)
4                     599.60
8                     321.18
16                    481.45
32                    483.63
64                    483.59
128                   567.62
256                   472.35
512                   431.96
1024                  609.19
2048                  288.70
4096                  355.52
8192                  425.21
16384                 546.61
32768                 739.76
65536                1501.53
131072               2027.41
262144               1015.34
524288               1328.23
1048576              2101.48

The large messages look ok but small messages are not great.

When forcing coll/han things look way better for small messages at a huge cost to the large message performance:

mpirun --mca btl_vader_single_copy_mechanism none --mca coll_base_verbose 0 --hostfile /shared/hostfile.ompi -n 128 -N 16 --bind-to core --mca coll_han_priority 100 ./osu_allreduce
App launch reported: 9 (out of 9) daemons - 112 (out of 128) procs

# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)
4                     111.77
8                     112.46
16                    111.98
32                    233.86
64                    198.94
128                   321.43
256                   286.42
512                   212.69
1024                  305.23
2048                  257.34
4096                  332.50
8192                  317.34
16384                 359.07
32768                 432.56
65536                 729.18
131072               1102.87
262144               1801.27
524288               3301.01
1048576              6245.48

Is this expected? Another MPI on the system is getting 74us for the small messages (below 1k) and 1400us for 1MB messages.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 14, 2021

I should have added that this is with 8 nodes.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 14, 2021

--mca coll_han_use_simple_allreduce true helps somewhat:

App launch reported: 9 (out of 9) daemons - 112 (out of 128) procs

# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)
4                      99.16
8                      99.97
16                    100.25
32                    101.31
64                    105.44
128                   105.94
256                   112.12
512                   113.90
1024                  121.76
2048                  136.25
4096                  142.46
8192                  223.59
16384                 248.33
32768                 292.69
65536                 401.75
131072                845.64
262144               1584.11
524288               2993.50
1048576              5885.93

@bosilca bosilca added this to the v5.0.0 milestone Jun 17, 2021
@gkatev
Copy link
Contributor

gkatev commented Mar 14, 2022

For whatever it may be worth, I recently ran some HAN benchmarks, and saw great improvement in large-message Allreduce latency by increasing the segment size (MCA coll_han_allreduce_segsize), from the default 64K to 1M. (with the non-simple implementation)

@jsquyres jsquyres modified the milestones: v5.0.0, v5.0.1 Oct 30, 2023
@janjust janjust modified the milestones: v5.0.1, v5.0.2 Jan 8, 2024
@jsquyres jsquyres modified the milestones: v5.0.2, v5.0.3 Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants