Skip to content

Intermittent failures possibly related to name conflicts in leftover Vader shared memory segment files #7308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mkre opened this issue Jan 15, 2020 · 8 comments

Comments

@mkre
Copy link

mkre commented Jan 15, 2020

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v3.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From a source tarball

Please describe the system on which you are running

  • Operating system/version: CentOS 7.5
  • Computer hardware: Intel Xeon
  • Network type: N/A

Details of the problem

Hi,

We have seen intermittent failures of Open MPI with the following error message:

--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  capri02
  System call: open(2) 
  Error:       Permission denied (errno 13)
--------------------------------------------------------------------------

When rerunning the same command line, the run usually succeeds. There are known issues about Vader shared memory segments not getting cleaned up (#6322, #6547), which I think might be partly related to this issue.

I think the failure occurs because there is a /dev/shm/vader_segment.* file leftover from another user which conflicts with the name of a /dev/shm/vader_segment.* file to be created/opened by this user. From here, it seems like the only part of the shared segment file name which could differentiate different users and runs on the same host is the OPAL_PROC_MY_NAME.jobid. I couldn't find the place in the code where the jobid gets computed.

Is it possible that there are conflicting jobids, causing conflict segment files and the observed "permission denied" error, between different users/runs? How can we avoid such failures on a multi-user system? I don't think that the fixes related to Vader cleanup will be sufficient, as there is still a way for Vader to leave over files, namely in case the job terminates abnormally, right?

Thanks,
Moritz

@mkre
Copy link
Author

mkre commented Feb 17, 2020

I'm wondering if there is a way to avoid the file name conflicts as long as issue #7220 is not resolved..?

@simonbyrne
Copy link
Contributor

We are also seeing similar issues (leftover memory segments created by another user are causing permissions errors).

One potential workaround would be to include the users uid in the filename?

@mkre
Copy link
Author

mkre commented Jul 28, 2020

@simonbyrne, we found that switching to Open MPI version 4 solved the issue for us, and seemed like the most future-proof solution anyway.

@KineticTheory
Copy link

I'm struggling with this error on two different machines and I'm seeing the error with both 3.1.5 and 4.0.3. Is there any type of work around? Can I use a custom path instead of /dev/shm for shared memory?

I normally see these errors overnight when a shared-resource compute server is running a lot of regression tests and the machine is potentially running with a heavy load.

@simonbyrne
Copy link
Contributor

simonbyrne commented Oct 27, 2020

AFAIK the problem isn't fixed on 3.x, so you can still have leftover files which will still cause problems with 4.x. If you want to support 3.x, the only solution is to manually cleanup, either after each job or with a cron script. We ended up asking our admins to set up a script to clean up the leftover files: https://docs.hpc.udel.edu/technical/whitepaper/automated_devshm_cleanup#cleaning-up

@jsquyres
Copy link
Member

Is this still happening in v4.0.5 and/or v4.1.0?

@simonbyrne
Copy link
Contributor

We are still seeing it, as there are some users on the cluster who use Open MPI v3 which leaves the files around, and these cause conflicts with v4.0.5. One solution as I mentioned above would be to add the users uid to the filename.

artpol84 pushed a commit to artpol84/ompi that referenced this issue Apr 20, 2021
Fixes open-mpi#7308

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 88be263)
@hppritcha
Copy link
Member

closed via #8802 and #8804. Unlikely to fix on the old 3.1.x and 3.0.x release branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants