-
Notifications
You must be signed in to change notification settings - Fork 900
Performance issues when using oversubscription #10426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@dalcinl I don't remember any changes this year that could cause this but I had a patch a while back that added an option to control how MPI processes yield the CPU when idle. The default is still |
I am also not aware of anything that went in that would cause such a regression in performance. This seems alarming though, and something that should be tracked down. |
@devreal My main concern is that I made absolutely no change to the way I configure Open MPI (either compile time or runtime), and yet I get this huge performance regression. Perhaps it is related to some change in GitHub Actions and how the runners are setup or configured. In the mean time, I'll try your suggestion and report back on the outcome. |
@devreal I tried your suggestion (with the default nanosleep time), but had no effect. |
@devreal @awlauria After manually bisecting the issue 😓, I can tell for sure the regression is within Open MPI. The last good commit is 3b4d64f [logs], running mpi4py testsuite on 5 MPI processes takes under 3 minutes. The regression is introduced after merging #9097 (MPI Sessions PR) in commit 7291361 [logs] (cancelled after 9 minutes of running on 3 MPI processes). @hppritcha From your recent interactions in other issues I submitted, I believe you may want to get involved in this one. |
Thanks for checking. Let me see if I can reproduce that on my machine |
looking in to this. |
looks like a piece from 2b335ed disappeared with the sessions support |
The sessions related commit 7291361 inadvertenly removed a bit of commit 2b335ed. Put it back in. Leave a chatty string to help with testing. this will be removed before merging. Related to issue open-mpi#10426 Signed-off-by: Howard Pritchard <[email protected]>
@devreal Look like your suggestion was not enough to workaround the issue. I'm just trying to figure out why. |
Yes, I was assuming that this was set automatically but it looks like it isn't. Have you tried setting |
i think the intent of 2b335ed was to avoid forcing the user to set additional ompi mca parameters if prte knows that the ompi app is being run in an oversubscribed manner. |
@hppritcha As I commented in #10428, your patch worked fine. My question to Joseph was unrelated to your fix, I just wanted to know why his suggestion did not work. |
The sessions related commit 7291361 inadvertenly removed a bit of commit 2b335ed. Put it back in. Leave a chatty string to help with testing. this will be removed before merging. Related to issue open-mpi#10426 Signed-off-by: Howard Pritchard <[email protected]>
The sessions related commit 7291361 inadvertenly removed a bit of commit 2b335ed. Put it back in. Leave a chatty string to help with testing. this will be removed before merging. Related to issue open-mpi#10426 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit f4156d3)
Merged to main and v5.0.x - closing. |
The sessions related commit 7291361 inadvertenly removed a bit of commit 2b335ed. Put it back in. Leave a chatty string to help with testing. this will be removed before merging. Related to issue open-mpi#10426 Signed-off-by: Howard Pritchard <[email protected]>
The sessions related commit 7291361 inadvertenly removed a bit of commit 2b335ed. Put it back in. Leave a chatty string to help with testing. this will be removed before merging. Related to issue open-mpi#10426 Signed-off-by: Howard Pritchard <[email protected]>
The sessions related commit 7291361 inadvertenly removed a bit of commit 2b335ed. Put it back in. Leave a chatty string to help with testing. this will be removed before merging. Related to issue open-mpi#10426 Signed-off-by: Howard Pritchard <[email protected]>
I've been testing mpi4py against
ompi/main
using an automatic schedule that runs on a weekly basis on GitHub Actions.The GitHub-hosted runners have only two virtual cores, but I run tests with up to 5 MPI processes turning on oversubscription (setting
rmaps_default_mapping_policy = :oversubscribe
within$HOME/.prte/mca-params.conf
).The current workflow file is here.
Up to January 29, the testsuite used to take around 4m 30s to run to completion [link].
Since February 5, the same testsuite needs around 1h 20m to finish [link].
Unfortunately, the actual logs are long gone, but at least you can see the elapsed time (look for the Test mpi4py (np=5) line).
Is this a regression or some expected change in behavior?
Is there a new configuration option I'm missing that would allow me to go back to the previous behavior?
The text was updated successfully, but these errors were encountered: