Skip to content

noinline to avoid compiler reading TOC before PATCHER_BEGIN #7839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

markalle
Copy link
Contributor

This bug was first seen in a different product that's using the same
interception code as OMPI. But I think it's potentially in OMPI too.

In my vanilla build of OMPI master on RH8 if I "gdb libopen-pal.so" and
"disassemble intercept_brk", I'm seeing a suspicious extra instruction
in front of PATCHER_BEGIN:
0x00000000000d6778 <+40>: std r2,24(r1) // something gcc put in front
0x00000000000d677c <+44>: std r2,96(r1) // PATCHER_BEGIN's toc_save
0x00000000000d6780 <+48>: nop // NOPs from PATCHER_BEGIN
0x00000000000d6784 <+52>: nop // that get replaced
0x00000000000d6788 <+56>: nop // by instructions that
0x00000000000d678c <+60>: nop // change r2
0x00000000000d6790 <+64>: nop //

Later there are loads from that location like
0x000000000019e0e4 <+132>: ld r2,24(r1)
that make me nervous since that's the pre-updated value.

I believe this is the same thing Nathan is describing way back in a9bc692
and his solution was to put a second call around each interception, where
the outer call is just
intercept_brk():
PATCHER_BEGIN
_intercept_brk()
PATCHER_END
and the inner call _intercept_brk() is where the bulk of the code goes.

What I'm seeing is that _intercept_brk() is being inlined and probably
negating Nathan's fix. So I want to add opal_attribute_noinline to
restore the fix.

With this commit in place, the disassembly of intercept_brk becomes tiny
because it's no longer inlining intercept_brk() and the susipicious
early save of r2 is gone. I made the same fix to all the intercept
*
functions, although intercept_brk was the only one that had a suspicious
save of r2.

As far as empirical failures though, we only have those from the non-OMPI
product that's using the same patcher code. I'm not actually getting OMPI
to fail from the above suspicious data being saved in r1+24.

Signed-off-by: Mark Allen [email protected]
(cherry picked from commit ddd1f57)

This bug was first seen in a different product that's using the same
interception code as OMPI.  But I think it's potentially in OMPI too.

In my vanilla build of OMPI master on RH8 if I "gdb libopen-pal.so" and
"disassemble intercept_brk", I'm seeing a suspicious extra instruction
in front of PATCHER_BEGIN:
   0x00000000000d6778 <+40>:    std     r2,24(r1) // something gcc put in front
   0x00000000000d677c <+44>:    std     r2,96(r1) // PATCHER_BEGIN's toc_save
   0x00000000000d6780 <+48>:    nop               // NOPs from PATCHER_BEGIN
   0x00000000000d6784 <+52>:    nop               // that get replaced
   0x00000000000d6788 <+56>:    nop               // by instructions that
   0x00000000000d678c <+60>:    nop               // change r2
   0x00000000000d6790 <+64>:    nop               //

Later there are loads from that location like
   0x000000000019e0e4 <+132>:   ld      r2,24(r1)
that make me nervous since that's the pre-updated value.

I believe this is the same thing Nathan is describing way back in a9bc692
and his solution was to put a second call around each interception, where
the outer call is just
    intercept_brk():
        PATCHER_BEGIN
        _intercept_brk()
        PATCHER_END
and the inner call _intercept_brk() is where the bulk of the code goes.

What I'm seeing is that _intercept_brk() is being inlined and probably
negating Nathan's fix.  So I want to add __opal_attribute_noinline__ to
restore the fix.

With this commit in place, the disassembly of intercept_brk becomes tiny
because it's no longer inlining _intercept_brk() and the susipicious
early save of r2 is gone.  I made the same fix to all the intercept_*
functions, although intercept_brk was the only one that had a suspicious
save of r2.

As far as empirical failures though, we only have those from the non-OMPI
product that's using the same patcher code.  I'm not actually getting OMPI
to fail from the above suspicious data being saved in r1+24.

Signed-off-by: Mark Allen <[email protected]>
(cherry picked from commit ddd1f57)
@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/c9790547bff3fd7fecd537d0ec073e05

@gpaulsen gpaulsen self-requested a review June 17, 2020 23:27
@gpaulsen gpaulsen added this to the v3.1.7 milestone Jun 17, 2020
@jjhursey
Copy link
Member

The crash from CI (seen here ) does not seem related as it is a double free crash in orted (so not impacted by patcher).

Let's try to retest: bot:ibm:gnu:retest

@markalle
Copy link
Contributor Author

I agree it shouldn't be related. Should we try to track down if something separate is broken in v3.1.x orted in the ibm CI environment?

@bwbarrett bwbarrett merged commit fbc8bca into open-mpi:v3.1.x Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants