-
Notifications
You must be signed in to change notification settings - Fork 900
shared memory segments not cleaned up by vader btl after program aborted #6322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When you say spawn, do you mean |
I meant |
It's odd. I just tested v3.1.3 and this also happens... |
Hello all, I've updated the post to include a short MWE to reproduce the issue, which I now believe is a bug. Please take a look. Thanks. |
Update: an MWE written in C also reproduces the error. |
Just wanna ring a bell and see if anyone could reproduce this. Thanks. |
A fix for vader (shared memory) cleanup just recently went in on the v4.0.x branch (but didn't make v4.0.1). Can you test any recent nightly snapshot on the v4.0.x branch and see if the problem has been resolved for you? |
@jsquyres Thanks for your reply and sorry for long silence. I just tested both the latest released versions v3.1.4 and v.4.0.1. I think the fix has appeared in v4.0.1 but not in v3 yet. Am I right that #6550 is the fix for this issue? Will it be back ported to v3.1 at some point? Thanks, and please feel free to close this issue. |
Background information
What version of Open MPI are you using?
v3.1.2, v3.1.3
Describe how Open MPI was installed
We have an internally managed Conda environment and we build our own Conda packages, include
openmpi
. (I think it was built from the tarball downloaded on the Open MPI website.) Then, it was installed usingconda install openmpi
.Please describe the system on which you are running
Details of the problem
If one spawns a few MPI processes, let them do some work, but terminate them abnormally (
ctrl-C
and whatnot), it can be seen that in/dev/shm/
there will be shared memory segments related to thevader
component that are not unlinked by Open MPI during the cleanup phase:I know that we didn't have this issue with v3.1.1
and I'll test v3.1.3 later(UPDATE: 3.1.3 also has this problem). For now I just need to know if this is a known bug with v3.1.2 so that I can avoid this version in our Conda settings. Thanks!UPDATE: a minimal working example in Python is provided below
and terminate it as mentioned above during the 30s sleeping. Note that if
-n 1
is used, no residual segment would be left in/dev/shm
even with abnormal abort. I double checked that in the3.1.x
series this problem only happens for 3.1.2 and 3.1.3.UPDATE2: identical MWE in C
The text was updated successfully, but these errors were encountered: