-
Notifications
You must be signed in to change notification settings - Fork 900
Intermittent failures possibly related to name conflicts in leftover Vader shared memory segment files #7308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm wondering if there is a way to avoid the file name conflicts as long as issue #7220 is not resolved..? |
We are also seeing similar issues (leftover memory segments created by another user are causing permissions errors). One potential workaround would be to include the users uid in the filename? |
@simonbyrne, we found that switching to Open MPI version 4 solved the issue for us, and seemed like the most future-proof solution anyway. |
I'm struggling with this error on two different machines and I'm seeing the error with both 3.1.5 and 4.0.3. Is there any type of work around? Can I use a custom path instead of I normally see these errors overnight when a shared-resource compute server is running a lot of regression tests and the machine is potentially running with a heavy load. |
AFAIK the problem isn't fixed on 3.x, so you can still have leftover files which will still cause problems with 4.x. If you want to support 3.x, the only solution is to manually cleanup, either after each job or with a cron script. We ended up asking our admins to set up a script to clean up the leftover files: https://docs.hpc.udel.edu/technical/whitepaper/automated_devshm_cleanup#cleaning-up |
Is this still happening in v4.0.5 and/or v4.1.0? |
We are still seeing it, as there are some users on the cluster who use Open MPI v3 which leaves the files around, and these cause conflicts with v4.0.5. One solution as I mentioned above would be to add the users uid to the filename. |
Fixes open-mpi#7308 Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 88be263)
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v3.1.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From a source tarball
Please describe the system on which you are running
Details of the problem
Hi,
We have seen intermittent failures of Open MPI with the following error message:
When rerunning the same command line, the run usually succeeds. There are known issues about Vader shared memory segments not getting cleaned up (#6322, #6547), which I think might be partly related to this issue.
I think the failure occurs because there is a
/dev/shm/vader_segment.*
file leftover from another user which conflicts with the name of a/dev/shm/vader_segment.*
file to be created/opened by this user. From here, it seems like the only part of the shared segment file name which could differentiate different users and runs on the same host is theOPAL_PROC_MY_NAME.jobid
. I couldn't find the place in the code where thejobid
gets computed.Is it possible that there are conflicting
jobid
s, causing conflict segment files and the observed "permission denied" error, between different users/runs? How can we avoid such failures on a multi-user system? I don't think that the fixes related to Vader cleanup will be sufficient, as there is still a way for Vader to leave over files, namely in case the job terminates abnormally, right?Thanks,
Moritz
The text was updated successfully, but these errors were encountered: