-
Notifications
You must be signed in to change notification settings - Fork 900
System call failure: unlink during MPI_Finalize() #9905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@hjelmn @bwbarrett @hppritcha Is there any way that this is related to #9868 / #9880? I know that those are about XPMEM and unrelated to Solaris, but is there a larger issue in v4.1.x:vader/master:sm cleanup? |
I do not think so as no signals (SIGBUS or whatever) were observed in any of the processes. |
@afborchert could you add the |
Sure:
I've added a run with “-d” as well:
|
Having difficulties reproducing. I do observe the session dir does not exist... messages from mpirun when using the -d option. These appear to be harmless. Does this problem appear to be specific to your solaris system? |
I'm getting nearly identical results on MacOS 12.2, with openmpi 4.1.2 installed via homebrew. I'm using software called Bertini.
The program still produces its results, but I'm getting these errors from MPI. Two additional pieces of information:
|
The issue with OS X is likely a different (and well known) one:
before invoking |
Indeed, on MacOS, this worked. Please pardon if my comments read as noise on this issue. Thank you very much! |
Was this solved for @afborchert? |
Oops -- I didn't notice that the "That fixed it!" notice wasn't from the original poster. My bad! @afborchert Can you chime in as to whether that fixed it for you? |
@ofloveandhate No, the close by @jsquyres was a surprise to me. However, I am not sure right now how I can help @hppritcha and possibly others to analyze this problem. Right now, we have OpenMP 4.1.2 just on Solaris. The messages appear indeed to be harmless but are, of course, annoying. This effect is intermittent and attempts to have this happening while running under truss (system call logger comparable to strace) so far never reproduced the problem. Please also note that the problem under Solaris is unrelated to the MacOS problem. |
This problem is still present in v4.1.4. |
Might this be related to #11123? See the second item in the issue, "Issue 2, error unlinking mmap backing file". There, I've observed the backing file getting deleted before its time somewhere else in the code, leading to the intended unlink call failing. In my scenario I LD_PRELOADed remove/unlink to find where this was happening. If you can reproduce it you could also use something like this to find it? Snippet of what I had: #define _GNU_SOURCE
#include <stdio.h>
#include <dlfcn.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
// gcc -shared -fPIC -o ul.so ul.c
int unlink(const char *path) {
int (*original)(const char *)
= dlsym(RTLD_NEXT, "unlink");
static int cnt = 0;
printf("==Unlink @ %s (%d)\n", path, ++cnt);
return original(path);
}
int remove(const char *path) {
int (*original)(const char *)
= dlsym(RTLD_NEXT, "remove");
static int cnt = 0;
printf("==Remove @ %s (%d)\n", path, ++cnt);
return original(path);
} (And when I captured the phenomenon using these functions, a carefully executed segfault to generate a backtrace) |
This is a case of OMPI registering an "epilog" call with PMIx - in this case, to remove the shmem backing file - and then duplicating the action in OMPI. It doesn't matter to PMIx who does the operation - we're just doing it because OMPI asked us to do so. Easy solution is to either have OMPI not register the "epilog", or to remove the "unlink" operation from OMPI. |
I've struggled a little bit reproducing the problem with a preloaded wrapper of unlink as suggested by @gkatev but was successful when I adapted it to postpone the output to an atexit handler:
The first number in each line is the pid, the second the counter. As it appears the file /tmp/ompi.theon.120/pid.984/1/vader_segment.theon.120.4b930001.0 is unlinked by two processes, 984 and 985. In cases when it runs through without errors, each process appears to unlink just one vader_segment file like here:
Here is the updated unlink wrapper: #include <dlfcn.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// gcc -shared -fPIC -o ul.so ul.c
static int initialized = 0;
enum { max_entries = 32, max_length = 128 };
static char paths[max_entries][max_length];
static int path_index = 0;
static void finish(void) {
if (path_index > 0) {
int pid = getpid();
fprintf(stderr, "unlinked files:\n");
for (int index = 0; index < path_index; ++index) {
fprintf(stderr, "[%5d, %2d] %s\n", pid, index, paths[index]);
}
}
}
static void record(const char* path) {
if (!initialized) {
atexit(finish);
initialized = true;
}
if (path_index < max_entries) {
strncpy(paths[path_index++], path, max_length);
}
}
int unlink(const char *path) {
record(path);
int (*original)(const char *) = dlsym(RTLD_NEXT, "unlink");
return original(path);
}
int remove(const char *path) {
record(path);
int (*original)(const char *) = dlsym(RTLD_NEXT, "remove");
return original(path);
} |
Now, I've tried to get an abort & core dump as soon as one of the vader files gets unlinked with error. Here is the output of such a case:
This is the updated version of the preloaded unlink wrapper: #include <dlfcn.h>
#include <errno.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// gcc -shared -fPIC -o ul.so ul.c
static int initialized = 0;
enum { max_entries = 32, max_length = 128 };
static char paths[max_entries][max_length];
static int path_index = 0;
static void finish(void) {
if (path_index > 0) {
int pid = getpid();
fprintf(stderr, "unlinked files:\n");
for (int index = 0; index < path_index; ++index) {
fprintf(stderr, "[%5d, %2d] %s\n", pid, index, paths[index]);
}
}
}
static void record(const char* path) {
if (!initialized) {
atexit(finish);
initialized = true;
}
if (path_index < max_entries) {
strncpy(paths[path_index++], path, max_length);
}
}
int unlink(const char *path) {
int (*original)(const char *) = dlsym(RTLD_NEXT, "unlink");
int rval = original(path);
if (rval < 0) {
if (strstr(path, "vader")) {
fprintf(stderr, "[%d] unlink(\"%s\") -> %d (errno = %d)\n",
(int) getpid(), path, rval, errno);
fprintf(stderr, "what happened before:\n");
finish();
abort();
}
}
record(path);
return rval;
}
int remove(const char *path) {
record(path);
int (*original)(const char *) = dlsym(RTLD_NEXT, "remove");
return original(path);
} |
So (assuming I'm interpreting the output correctly) the file
The trace of 1524 looks about right? (right = where we expect unlink to be triggered). I'm assuming the path it takes is So I would be curious where the other unlink, by 1522, took place. Could you trigger an abort when the specific vader segment gets unlinked even if the call was successful? If you have a debug build you might also use |
Ok, I adapted the preloaded wrapper to abort whenever a vader segment file is unlinked by the parent process. Here are the results:
And here is the backtrace without lines as MPI has been compiled without “-g” flag:
And for documentation purposes the updated preloaded wrapper: #include <dlfcn.h>
#include <errno.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// gcc -shared -fPIC -o ul.so ul.c
static int initialized = 0;
enum { max_entries = 32, max_length = 128 };
static char paths[max_entries][max_length];
static int path_index = 0;
static int forked = 0;
static void finish(void) {
if (path_index > 0) {
int ppid = getppid();
int pid = getpid();
fprintf(stderr, "unlinked files:\n");
for (int index = 0; index < path_index; ++index) {
fprintf(stderr, "[%5d, %5d, %2d] %s\n", ppid, pid, index, paths[index]);
}
}
}
static void record(const char* path) {
if (!initialized) {
atexit(finish);
initialized = true;
}
if (path_index < max_entries) {
strncpy(paths[path_index++], path, max_length);
}
}
pid_t fork(void) {
int (*original)(void) = dlsym(RTLD_NEXT, "fork");
++forked;
return original();
}
int unlink(const char *path) {
int (*original)(const char *) = dlsym(RTLD_NEXT, "unlink");
int rval = original(path);
if (forked && strstr(path, "vader")) {
fprintf(stderr, "[%d, %d] unlink(\"%s\") -> %d (errno = %d)\n",
(int) getppid(), (int) getpid(), path, rval, errno);
if (path_index > 0) {
fprintf(stderr, "what happened before:\n");
finish();
}
abort();
}
record(path);
return rval;
}
int remove(const char *path) {
record(path);
int (*original)(const char *) = dlsym(RTLD_NEXT, "remove");
return original(path);
} |
So here is the logic behind the epilog registration. The "vader" component usually puts its backing file under the scratch session directory as specified by its local mpirun daemon, or by the resource manager. The session directory gets cleaned up by I forget the root cause, but sometimes the "vader" component puts its backing file under the So we added a "hook" that allows "vader" to register its backing file for cleanup with the local The problem is that OMPI still attempts to I believe OMPI retained the cleanup as an insurance policy - if the daemon fails, then the app will finalize and terminate itself. Without OMPI cleanup, the backing file won't be removed. I don't know the correct answer - I suspect there is a decent argument that some annoyance over verbose unlink warnings is a reasonable price to pay for ensuring OMPI doesn't pollute system directories. However, I leave that for others to debate. I don't know if there is a way to silence the |
Thanks, this all makes sense. Perhaps this could be made to be a soft error and not show an error message to the user when unlink fails. Or not show an error message when it happens during finalization. Or, not show an error message when it happens during finalization and errno is Or if the pmix-safety-net-unlink happened strictly after the explicit fine-grain unlinks to catch all that the left got behind all this wouldn't be a problem, but I imagine there are bigger reasons that dictate the finalization order or lack thereof. I believe the unlink messages comes from here: ompi/opal/mca/shmem/mmap/shmem_mmap_module.c Lines 590 to 597 in ffb0adc
|
Precautions in regard to cleaning up files in /dev can be at least skipped for Solaris as non-root users cannot create anything below /dev and /dev/shm does not exist. As far as I know, /dev/shm appears to be a Linux-only feature. |
I suspect that is getting too targeted - imagine doing something like that for every edge case. |
I suspect it could be made into an optional warning message - perhaps only when verbosity has been turned up. All you would need to do is add a little wrapper around it: if (-1 == unlink(ds_buf->seg_name)) {
if (1 < opal_output_get_verbosity(opal_shmem_base_framework.framework_output)) {
int err = errno;
char hn[OPAL_MAXHOSTNAMELEN];
gethostname(hn, sizeof(hn));
opal_show_help("help-opal-shmem-mmap.txt", "sys call fail", 1, hn,
"unlink(2)", ds_buf->seg_name, strerror(err), err);
return OPAL_ERROR;
}
return OPAL_SUCCESS;
} |
I've now a backtrace with line numbers included:
|
Try wrapping the warning message as shown above. If that helps resolve the issue, then please file a PR with that change. It needs to go into the OMPI "main" branch first, and then get cherry-picked back to the v5 and v4.1 release branches. |
We've been running into this problem with certain systems during our nightly testing. The common thread seems to be systems running Ubuntu 22.04 LTS. I applied the patch suggested by @rhc54 to our Open MPI 4 builds, and this seems to have alleviated most of the issue. However, we are wondering if there could be a secondary effect in MPI applications that also use shared memory apart from Open MPI's use of shared memory. While testing the NWCHEM app with our Open MPI 4.1.5 build on an Ubuntu 22.03 LTS system, we see the following error messages when NWCHEM tries to clean up its shm segments:
I believe this uses the vader shm mechanism on a single node with multiple GPUs. I'm wondering if this could be tied to the shm cleanup behavior in Open MPI, where Open MPI is cleaning up the shm segments without the app being aware of it. Thanks in advance. |
This obviously isn't coming from OMPI, so I'm not sure I understand the title of this issue since it wouldn't be coming from If so, then yes, you would get such error messages as we cleanup everything in that tree during "finalize". If you want to continue using a file in there, then you probably need to delay calling |
Sorry for the confusion - you are correct that the error messages are coming from NWCHEM and not OMPI. My question pertained to whether there could be an interaction between OMPI and NWCHEM involving shared memory, and it sounds like there is. This question come to me from one of our applications engineers, so I will pass this along back to them and see what they say. Thanks! |
Yeah, it sounds like they probably put their shmem files in our session directory tree - which is totally fine. Somewhat the point of the session directory. However, the constraint on doing so is that you cannot "finalize" OMPI until you are done with those files as we will clean them up. |
Point me to the text of the MPI standard that says NWChem is not allowed to create its own shm files in any directory it wants. OMPI should not delete shm files it didn't create. This is a bug and it needs to be fixed. |
Sure they can - nobody is restricting it. But we created the session directory tree and are therefore obligated to clean it up. We get harpooned any time we fail to fully remove all the directories. In the current version of You cannot please everyone - all you can do is try and explain the constraints 🤷♂️ |
Based on the source code of ComEx, it is using either |
No, and we don't do that - so those unlink problems are something in their code. I know nothing about what they are doing. If the files are not in our session directory tree, then we won't touch them. |
Background information
What version of Open MPI are you using?
4.1.2
Describe how Open MPI was installed
Downloaded from https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.bz2, unpacked, and built with following script:
Please describe the system on which you are running
Details of the problem
Intermittently, even most simple MPI applications that are run locally with shared memory fail at MPI_Finalize() with errors as following:
This happens even for most trivial test programs like the following:
Just run mpirun multiple times and it will eventually fail:
We had the very same problem with Open MPI 4.1.1. We had no such problems with Open MPI 2.1.6.
The text was updated successfully, but these errors were encountered: