Skip to content

Failure to stop at MPIR_Breakpoint with OpenMPI v3.1.x and v4.0.x #7757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
louisespellacy-arm opened this issue May 21, 2020 · 31 comments · Fixed by #8422
Closed

Failure to stop at MPIR_Breakpoint with OpenMPI v3.1.x and v4.0.x #7757

louisespellacy-arm opened this issue May 21, 2020 · 31 comments · Fixed by #8422
Labels
RTE Issue likely is in RTE or PMIx areas Target: v4.0.x
Milestone

Comments

@louisespellacy-arm
Copy link

louisespellacy-arm commented May 21, 2020

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v3.1.6 and v4.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from source tarball with PGI 19.5 and 20.1

Please describe the system on which you are running

  • Operating system/version: RHEL7
  • Computer hardware: x86_64
  • Network type: self

Details of the problem

GDB unable to stop at MPIR_Breakpoint when debugging mpirun process with the MPIR interface with out-of-the-box OpenMPI installations. MPIR_Breakpoint is being optimized out.
#5501 relates to this issue and the reproducer can be re-used. This issue is not occurring with PGI 18.7.

However, providing CFLAGS=-O1, FCFLAGS=-O1 and CXXFLAGS=-O1 when building OpenMPI allows GDB to stop at MPIR_Breakpoint.

It does not work if you use -O2.

Upon looking at the building of orted/orted_submit.c, it is being built with CFLAGS when it shouldn't be.

The compile and link lines being generated by automake in orte/Makefile are:

liborted_mpir_la_LINK = $(LIBTOOL) $(AM_V_lt) --tag=CC \
      $(AM_LIBTOOLFLAGS) $(LIBTOOLFLAGS) --mode=link $(CCLD) \
      $(liborted_mpir_la_CFLAGS) $(CFLAGS) $(AM_LDFLAGS) $(LDFLAGS) \
      -o $@

$(AM_V_CC_no)$(LIBTOOL) $(AM_V_lt) --tag=CC $(AM_LIBTOOLFLAGS) $(LIBTOOLFLAGS) \
--mode=compile $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) \
$(CPPFLAGS) $(liborted_mpir_la_CFLAGS) $(CFLAGS) -c -o orted/liborted_mpir_la-orted_submit.lo \
`test -f 'orted/orted_submit.c' || echo '$(srcdir)/'`orted/orted_submit.c

In orte/orted/Makefile.am, the CFLAGS are defined to remove the optimzations, but $(CFLAGS) is being added to the compile line, adding the optimzations anyway.

liborted_mpir_la_CFLAGS = $(CFLAGS_WITHOUT_OPTFLAGS) $(DEBUGGER_CFLAGS)

By manually removing CFLAGS from the compile and link line for orted_submit.c, there would be no optimization when building orted/orted_submit.c, as described in the comments.

@jjhursey jjhursey added the RTE Issue likely is in RTE or PMIx areas label May 22, 2020
@awlauria
Copy link
Contributor

What version of automake are you using?

It was found that older versions of automake are adding the -O3, and newer versions are doing the correct thing in not adding it.

See this comment/discussion:

#6828 (comment)

@awlauria
Copy link
Contributor

it's also documented in the code of orted_submit.c, for reference:

https://github.com/awlauria/ompi/blob/00106f5ac96a3d9e6288ec07dc47e325897cd5f8/orte/orted/orted_submit.c#L181

@jsquyres
Copy link
Member

I think @awlauria is asking: are you running autogen.pl during the build process? Or just ./configure / make?

@louisespellacy-arm
Copy link
Author

I was building out-of-the-box openmpi-3.1.6.tar.gz and openmpi-4.0.3.tar.gz when first testing. Using
the configure script that it ships with.
./configure / make
Unless modified by the user, the resulting CFLAGS is:

$ ompi_info -a | grep CFLAGS
Build CFLAGS: -O3 -DNDEBUG -finline-functions -fno-strict-aliasing

By just running ./configure / make you get the following line in the orte Makefile:

$(AM_V_CC_no)$(LIBTOOL) $(AM_V_lt) --tag=CC $(AM_LIBTOOLFLAGS) $(LIBTOOLFLAGS) --mode=compile $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) $(liborted_mpir_la_CFLAGS) $(CFLAGS) -c -o orted/liborted_mpir_la-orted_submit.lo `test -f 'orted/orted_submit.c' || echo '$(srcdir)/'`orted/orted_submit.c

Later I pulled from the most recent branch of v4.0.x and used automake/configure/make etc. This results in the makeline for orted_submit.c as seen in my comment.

 $ automake --version
automake (GNU automake) 1.15.1

To ensure that $(CFLAGS) was not added, I needed to make changes similar to those found in https://github.com/awlauria/ompi/blob/00106f5ac96a3d9e6288ec07dc47e325897cd5f8/ompi/debuggers/Makefile.am#L28

@louisespellacy-arm
Copy link
Author

Is there any update on this issue? This issue persists with OpenMPI 4.1.0 recently released and PGI 20.1.

orted_submit.c is still being compiled with -O3.

@awlauria
Copy link
Contributor

@louisespellacy-arm sorry, this slipped through the cracks. I'll take a look.

@louisespellacy-arm
Copy link
Author

louisespellacy-arm commented Jan 22, 2021

Attempted to fix the above issue with the following changes but unsure if its the best approach:

diff --git a/orte/orted/Makefile.am b/orte/orted/Makefile.am
index 1235e51e69..7523cc5336 100644
--- a/orte/orted/Makefile.am
+++ b/orte/orted/Makefile.am
@@ -19,6 +19,8 @@
 #
 # $HEADER$
 #
+CFLAGS = $(CFLAGS_WITHOUT_OPTFLAGS) $(DEBUGGER_CFLAGS)
+#
 
 # This makefile.am does not stand on its own - it is included from orte/Makefile.am
 
@@ -38,7 +40,7 @@ lib@ORTE_LIB_PREFIX@open_rte_la_SOURCES += \
 noinst_LTLIBRARIES += liborted_mpir.la
 liborted_mpir_la_SOURCES = \
        orted/orted_submit.c
-liborted_mpir_la_CFLAGS = $(CFLAGS_WITHOUT_OPTFLAGS) $(DEBUGGER_CFLAGS)
+#liborted_mpir_la_CFLAGS = $(CFLAGS_WITHOUT_OPTFLAGS) $(DEBUGGER_CFLAGS)
 
 lib@ORTE_LIB_PREFIX@open_rte_la_LIBADD += liborted_mpir.la

@gpaulsen gpaulsen added this to the v4.0.6 milestone Jan 25, 2021
@awlauria
Copy link
Contributor

@louisespellacy-arm while that will work in a pinch, I don't think that is the best solution. Doing that will propagate the debugger CFLAGS for everything below the orte tree (losing -O3) for basically all of orte as far as I can tell.

Unfortunately, it seems that libtool is blasting every file with -O3 again. The intent was (as far as I remember) to not compile this file with -O3, but I don't see an easy/clean way of doing that for this one library.

One way that I can think of is to have something like an OPAL_C/CPPFLAGS variable that propagates down to every Makefile.am. It's a tedious change, but will give more control to every library component on what CFLAGS to build with.

@awlauria
Copy link
Contributor

awlauria commented Jan 26, 2021

I'm no configure expert, so maybe there's a better way. I have a branch where this is partially implemented, and it seems to as well as before. Basically I just replaced CFLAGS/CPPFLAGS with OPAL_CFLAGS/OPAL_CPPFLAGS, and set CFLAGS/CPPFLAGS to be empty, preventing libtool from tacking them on. Then I've gone across a subset of makefiles and set them for each library. They will also have to be passed down to the likes of hwloc/prrte and other 3rd party packages, but that should be just as easy.

This approach is in line with what gnu suggests in the automake docs:

Using variables like this gives you full control over the ordering of the flags. For instance, 
if there is a flag in $(WARNINGCFLAGS) that you want to negate for a particular target,
you can use something like 'prog1_CFLAGS = $(AM_CFLAGS) -no-flag'. If all of 
 these flags had been forcefully appended to CFLAGS, there would be no way to disable one flag.
 Yet another reason to leave user variables to users. C

https://www.gnu.org/software/automake/manual/html_node/Flag-Variables-Ordering.html

I can continue this work and post a patch if the community is interested.

@jjhursey
Copy link
Member

The OPAL_{CFLAGS|CPPFLAGS} method looks like it would work, but is pretty time-intensive to implement. Before going too much further down that road the community (at least those familiar with the build system - @jsquyres @bwbarrett @ggouaillardet mayeb others) should weigh in so Austen doesn't waste effort here if there is another way.

The only other way I thought of was to inject a script at the end of configure that would modify the generated Makefile (or at the end of autogen.pl if we can find a way) to strip out the CFLAGS for the orte_submit.c compilation step. It's pretty hacky, but should do the trick.

@gpaulsen if we don't reach a decision beforehand, can you add this to next week's agenda?

@jsquyres
Copy link
Member

It looks like we did this differently in https://github.com/open-mpi/ompi/blob/master/ompi/debuggers/Makefile.am -- we overrode CFLAGS for the entire Makefile.am. It looks like we did that because the _LINK line for the .lo files are like this (from the rendered orte/orted/Makefile -- showing line numbers and line breaks from my text editor, sorry):

2461 orted/liborted_mpir_la-orted_submit.lo: orted/orted_submit.c
2462         $(AM_V_CC)$(LIBTOOL) $(AM_V_lt) --tag=CC $(AM_LIBTOOLFLAGS) $(LIBTOOLFLAGS)\
      --mode=compile $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPF\
     LAGS) $(liborted_mpir_la_CFLAGS) $(CFLAGS) -MT orted/liborted_mpir_la-orted_submit.\
     lo -MD -MP -MF orted/$(DEPDIR)/liborted_mpir_la-orted_submit.Tpo -c -o orted/libort\
     ed_mpir_la-orted_submit.lo `test -f 'orted/orted_submit.c' || echo '$(srcdir)/'`ort\
     ed/orted_submit.c

Notice how it still has $(CFLAGS) in there (which contains things like -O3). ☹️

We can override CFLAGS for the entire ompi/debuggers/Makefile.am because that's a small Makefile.am.

This is not the case for orte/orted/Makefile.am -- that file is included in the upper-level orte/Makefile.am rather than traversing down into orte/orted and invoking make. We can't override CFLAGS for all of ORTE, because that will have larger side effects (e.g., remove opt flags like -O3 from all ORTE objects).

It may well be necessary to make orte/orted/Makefile.am be a standalone Makefile.am (where you can override CFLAGS for the whole file)...?

@awlauria
Copy link
Contributor

awlauria commented Jan 26, 2021

It may well be necessary to make orte/orted/Makefile.am be a standalone Makefile.am (where you can override CFLAGS for the whole file)...?

That would work to limit the scope some, but all files beneath orte/orted would also be affected (there's a pmix dir there), right?

The global CFLAGS hammer is...burdensome to work-around. If we were to remove it, the added flexibility it will give developers/users to mix and match flags between components is a nice bonus as well.

@jsquyres
Copy link
Member

The end goal here is to compile the one file that is necessary without the regular CFLAGS. You may even want to move that one file into its own subdirectory so that it can have its own Makefile.am. We more-or-less did that in ompi/debuggers, and we've definitely done similar things in the Fortran MPI bindings.

This is unfortunately just how Automake rolls...

@jsquyres
Copy link
Member

Well, actually, it just occurs to me that there could be another method that could work, but it may be a bit crazy / not worth it.

You could just override the rule for orted/liborted_mpir_la-orted_submit.lo in orte/orted/Makefile.am.

I say that this is a little crazy because, honestly, this scheme is a little dicey: if Libtool ever changes the rules that they generate, we'll be out of sync with them, and that could be problematic.

@awlauria
Copy link
Contributor

The end goal here is to compile the one file that is necessary without the regular CFLAGS. You may even want to move that one file into its own subdirectory so that it can have its own Makefile.am. We more-or-less did that in ompi/debuggers, and we've definitely done similar things in the Fortran MPI bindings.

This is unfortunately just how Automake rolls...

True this is only for one file. But the way we're doing it now is not recommended by the automake docs as I read them - IE using CFLAGS as a global entity 'that everyone shall have'. And while right now we are running into this issue for mpir, which admittedly is going away, I'm sure we will eventually run into this issue again. And when that happens, we'll have to reshuffle things around or tinker with make files instead of adjusting one flag for that specific library.

You may even want to move that one file into its own subdirectory so that it can have its own Makefile.am. We more-or-less did >that in ompi/debuggers, and we've definitely done similar things in the Fortran MPI bindings.

I tried that approach to no success. I gave it it's own Makefile.am, but the resetting of CFLAGS still wound up in the generated orte/Makefile. I probably am missing something here, will tinker with it. I agree in theory that this will work as a fix for this file if I can get it working, but it feels like kicking the can until we have to go through this again.

awlauria added a commit to awlauria/ompi that referenced this issue Jan 27, 2021
In optimized builds, CFLAGS contains various optimizations such as -O3,
and is propogated by automake to all files. To work-around this,
isolate MPIR_Breakpoint() and other MPIR_* symbols into its own library
built with debugger specific CFLAGS.

To prevent CFLAGS from being polluted elsewhere in the make tree, build
this in its own tiny stand-alone makefile.

Fixes open-mpi#7757
awlauria added a commit to awlauria/ompi that referenced this issue Jan 27, 2021
In optimized builds, CFLAGS contains various optimizations such as -O3,
and is propogated by automake to all files. To work-around this,
isolate MPIR_Breakpoint() and other MPIR_* symbols into its own library
built with debugger specific CFLAGS.

To prevent CFLAGS from being polluted elsewhere in the make tree, build
this in its own tiny stand-alone makefile.

Fixes open-mpi#7757
awlauria added a commit to awlauria/ompi that referenced this issue Jan 27, 2021
In optimized builds, CFLAGS contains various optimizations such as -O3,
and is propogated by automake to all files. To work-around this,
isolate MPIR_Breakpoint() and other MPIR_* symbols into its own library
built with debugger specific CFLAGS.

To prevent CFLAGS from being polluted elsewhere in the make tree, build
this in its own tiny stand-alone makefile.

Fixes open-mpi#7757

Signed-off-by: Austen Lauria <[email protected]>
awlauria added a commit to awlauria/ompi that referenced this issue Jan 27, 2021
In optimized builds, CFLAGS contains various optimizations such as -O3,
and is propogated by automake to all files. To work-around this,
isolate MPIR_Breakpoint() and other MPIR_* symbols into its own library
built with debugger specific CFLAGS.

To prevent CFLAGS from being polluted elsewhere in the make tree, build
this in its own tiny stand-alone makefile.

Fixes open-mpi#7757

Signed-off-by: Austen Lauria <[email protected]>
awlauria added a commit to awlauria/ompi that referenced this issue Jan 27, 2021
In optimized builds, CFLAGS contains various optimizations such as -O3,
and is propogated by automake to all files. To work-around this,
isolate MPIR_Breakpoint() and other MPIR_* symbols into its own library
built with debugger specific CFLAGS.

To prevent CFLAGS from being polluted elsewhere in the make tree, build
this in its own tiny stand-alone makefile.

Fixes open-mpi#7757

Signed-off-by: Austen Lauria <[email protected]>
awlauria added a commit to awlauria/ompi that referenced this issue Jan 27, 2021
In optimized builds, CFLAGS contains various optimizations such as -O3,
and is propogated by automake to all files. To work-around this,
isolate MPIR_Breakpoint() and other MPIR_* symbols into its own library
built with debugger specific CFLAGS.

To prevent CFLAGS from being polluted elsewhere in the make tree, build
this in its own tiny stand-alone makefile.

Fixes open-mpi#7757

Signed-off-by: Austen Lauria <[email protected]>
awlauria added a commit to awlauria/ompi that referenced this issue Jan 28, 2021
In optimized builds, CFLAGS contains various optimizations such as -O3,
and is propogated by automake to all files. To work-around this,
isolate MPIR_Breakpoint() and other MPIR_* symbols into its own library
built with debugger specific CFLAGS.

To prevent CFLAGS from being polluted elsewhere in the make tree, build
this in its own tiny stand-alone makefile.

Fixes open-mpi#7757

Signed-off-by: Austen Lauria <[email protected]>
(cherry picked from commit 6d82003)
@awlauria
Copy link
Contributor

@louisespellacy-arm See #8428 for the v4.1.x fix, if you have time to verify. Thanks for your patience!

@louisespellacy-arm
Copy link
Author

Built using the pull requested linked and built with PGI 20.1. Checking breaks at MPIR_Breakpoint seems to give desired output - at the location of new file.

  $ gdb --quiet --args mpirun -np 2 ./hello_c
Reading symbols from mpirun...done.
(gdb) start
Temporary breakpoint 1 at 0x401180: file ../../../../orte/tools/orterun/main.c, line 12.
Starting program: /home/louspe01/install/openmpi-v4.1.x_pgi_20.1_pulled_fix/bin/mpirun -np 2 ./hello_c
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Temporary breakpoint 1, main () at ../../../../orte/tools/orterun/main.c:12
12	{
(gdb) set MPIR_being_debugged=1
(gdb) break MPIR_Breakpoint
Breakpoint 2 at 0x7ffff4d20c50: file ../../../../orte/orted/orted-mpir/orted_mpir_breakpoint.c, line 63.
(gdb) c
Continuing.
[New Thread 0x7ffff0102700 (LWP 25236)]
[New Thread 0x7fffef4f7700 (LWP 25237)]
[New Thread 0x7fffee4d5700 (LWP 25238)]
--------------------------------------------------------------------------
Open MPI has detected that you have attached a debugger to this MPI
job, and that debugger is using the legacy "MPIR" method of
attachment.

Please note that Open MPI has deprecated the "MPIR" debugger
attachment method in favor of the new "PMIx" debugger attchment
mechanisms.

*** This means that future versions of Open MPI may not support the
*** "MPIR" debugger attachment method at all.  Specifically: the
*** debugger you just attached may not work with future versions of
*** Open MPI.

You may wish to contact your debugger vendor to inquire about support
for PMIx-based debugger attachment mechanisms. Meantime, you can
disable this warning by setting the OMPI_MPIR_DO_NOT_WARN envar to 1.
--------------------------------------------------------------------------

Thread 1 "mpirun" hit Breakpoint 2, MPIR_Breakpoint ()
    at ../../../../orte/orted/orted-mpir/orted_mpir_breakpoint.c:63
63	    orte_noop_mpir_breakpoint_ptr = (volatile void *) 0x42;

@gpaulsen
Copy link
Member

This should be resolved in v4.0.6 by #8422.

@louisespellacy-arm
Copy link
Author

Excellent thanks for all the help!

@jsquyres
Copy link
Member

@gpaulsen @awlauria Is this not needed on master?

@awlauria
Copy link
Contributor

It's my understanding that MPIR is not on master - the standard is shifting to PMix based standards:

https://github.com/openpmix/mpir-to-pmix-guide

That said there may be some adjustments needed in PMix to make sure the same thing doesn't happen with the new implementation.

@jsquyres
Copy link
Member

Umm... right. Duh. Got it.

@rhc54
Copy link
Contributor

rhc54 commented Jan 29, 2021

No memory mapping is done with PMIx tools, so we shouldn't have to worry about this particular problem 😄

@louisespellacy-arm
Copy link
Author

Are changes still being made to the v3.1.x branch? PGI/NVHPC is still shipping a pre-built openmpi-3.1.x and openmpi-4.0.x - would it be possible to apply the pull request to v3.1.x also?

@awlauria
Copy link
Contributor

awlauria commented Feb 1, 2021

It's my understanding that v3.1 is closed, or very limited. @jsquyres @bwbarrett would you consider taking it - if only for the nightly builds.

@jsquyres
Copy link
Member

jsquyres commented Feb 1, 2021

We talked about this today (i.e., merging to v3.1). @awlauria is going to make a v3.1.x PR. If it's on the same order of magnitude as the master / v4.0.x / v4.1.x PRs (i.e., self-contained and low risk), we're open to merging it on v3.1.x. We will almost certainly not do a new release, though -- but a snapshot build tarball from the v3.1.x branch will be available within 24 hours of merging (see https://www.open-mpi.org/nightly/v3.1.x/).

awlauria added a commit to awlauria/ompi that referenced this issue Feb 1, 2021
In optimized builds, CFLAGS contains various optimizations such as -O3,
and is propogated by automake to all files. To work-around this,
isolate MPIR_Breakpoint() and other MPIR_* symbols into its own library
built with debugger specific CFLAGS.

To prevent CFLAGS from being polluted elsewhere in the make tree, build
this in its own tiny stand-alone makefile.

Fixes open-mpi#7757

Signed-off-by: Austen Lauria <[email protected]>
(cherry picked from commit 6d82003)
@awlauria
Copy link
Contributor

awlauria commented Feb 2, 2021

Closing this, as all PR's have been merged. @louisespellacy-arm if you encounter any other issues, feel free to re-open or create a new issue.

Thanks!

@awlauria awlauria closed this as completed Feb 2, 2021
@louisespellacy-arm
Copy link
Author

We talked about this today (i.e., merging to v3.1). @awlauria is going to make a v3.1.x PR. If it's on the same order of magnitude as the master / v4.0.x / v4.1.x PRs (i.e., self-contained and low risk), we're open to merging it on v3.1.x. We will almost certainly not do a new release, though -- but a snapshot build tarball from the v3.1.x branch will be available within 24 hours of merging (see https://www.open-mpi.org/nightly/v3.1.x/).

Thanks! Useful so we can indicate to customers to pull latest if they encounter issues!

@nmorey
Copy link
Contributor

nmorey commented Apr 30, 2021

Is there a reason why the new library was not versioned ?

@awlauria
Copy link
Contributor

@nmorey this seems like an oversight on my part.

There is a PR for v4.1.x here: #8892

It should also be brought over to v4.0.x, and possibly the v3.* branches.

@nmorey
Copy link
Contributor

nmorey commented Apr 30, 2021

@awlauria Thanks. I had written something similar but I'll pick this patch instead for the openmpi4 4.1.1 package on SUSE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RTE Issue likely is in RTE or PMIx areas Target: v4.0.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants