-
Notifications
You must be signed in to change notification settings - Fork 900
munmap is not being intercepted for cache refresh #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
the rationale for why the direct dependency on open-rte and open-pal was dropped is there https://svn.open-mpi.org/trac/ompi/ticket/2092 E. |
@emmanuelthome Excellent spelunking! Many thanks for reminding me that Past Jeff wrote out the whole reason why we no longer explicitly @markalle @gpaulsen This might be an excellent argument to bring in the memory hook replacements we discussed this past week (but we'll need to solve the UCX-has-the-same-memory-hooks issue). |
@jsquyres You're welcome ;-) Speaking of this, I got bitten again this week by some bug which smells like a similar "hook does get called" issue (not 100% sure yet. But at least an openib-specific segfault which disappears with mpi_leave_pinned 0...). I briefly considered trying out ummunotify. Seems to be dead and buried, unfortunately. #429 hints in that direction, at least. Do you confirm ? |
@emmanuelthome #429 has been on my to-do list for quite a while, and I haven't gotten to it. :-( Mellanox saw some failures that looked like the Open MPI ummunotify code paths were broken, but they didn't investigate deeply -- that's what #429 is about. There was a conversation at the Open MPI dev meeting this past week (https://github.com/open-mpi/ompi/wiki/Meeting-2016-02) about using the Platform MPI method of memory hooks. That might end up moving forward, which could solve both this issue and (maybe?) obviate the need for ummunotify support...? That being said, the Platform MPI method is apparently identical to the UCX method, and therefore they can conflict with each other in userspace. Hence, ummunotify may well still be the One True Answer (i.e., kernel-level support). If you have a little time, if you could verify if #429 is actually due to ummunotify code paths in OMPI being broken, that would be most helpful. |
It SHOULD be possible that while we're setting up the memory hooks at the symbol level, to determine if a hook is ALREADY installed, and chain the hooks. If ALL of the hooks do this, (i.e. UCX and us) then this could work (analogous to how signal handlers chain together). Of course the signal handler approach is well documented by the OS, and this is not documented and pretty hacky. I feel like this is edging out further on a thin tree limb that should have broken long ago. |
Looking at the code it's "probably" possible to detect a prior usage of this trick, and even save the function pointer that the previous product had registered. Then inside our interception we could either
|
I will ask to get Mark to contribute the Platform MPI solution for this fix ASAP, possibly with the hook to also allow other hooks to fire during any hook event (similar to signal handlers). |
This was closed via #1495. |
osc/pt2pt: fix typo
Revamp the CLI system
Per the thread starting here:
http://www.open-mpi.org/community/lists/users/2014/11/25730.php
munmap is not being intercepted properly. Late in the thread, it appears that this is happening because the wrapper compiler is only -lmpi, and not explicitly bringing in libopen-pal (where the munmap intercept lives). This means that the user is bringing in munmap from libc, and not seeing the OMPI munmap.
Hence, we're not intercepting munmap, and Badness occurs.
One thought on how to fix this is to re-introduce linking to libopen-rte and libopen-pal in the wrapper compilers. We distinctly took this behavior out at one point (and deliberately just linked against libmpi), but I confess to not remembering the exact reason why. It may have been taken out just because "it's the right thing -- we can have implicit dependencies pull in the rest", or it may have been so that we could support external ORTE and OPAL installations. Not sure. Someone will need to spelunk into the history to find out why. This may give insight into whether we can put the -lopen-rte -lopen-pal back in the wrappers.
The text was updated successfully, but these errors were encountered: