noinline to avoid compiler reading TOC before PATCHER_BEGIN #7839

markalle · 2020-06-17T22:49:14Z

This bug was first seen in a different product that's using the same
interception code as OMPI. But I think it's potentially in OMPI too.

In my vanilla build of OMPI master on RH8 if I "gdb libopen-pal.so" and
"disassemble intercept_brk", I'm seeing a suspicious extra instruction
in front of PATCHER_BEGIN:
0x00000000000d6778 <+40>: std r2,24(r1) // something gcc put in front
0x00000000000d677c <+44>: std r2,96(r1) // PATCHER_BEGIN's toc_save
0x00000000000d6780 <+48>: nop // NOPs from PATCHER_BEGIN
0x00000000000d6784 <+52>: nop // that get replaced
0x00000000000d6788 <+56>: nop // by instructions that
0x00000000000d678c <+60>: nop // change r2
0x00000000000d6790 <+64>: nop //

Later there are loads from that location like
0x000000000019e0e4 <+132>: ld r2,24(r1)
that make me nervous since that's the pre-updated value.

I believe this is the same thing Nathan is describing way back in a9bc692
and his solution was to put a second call around each interception, where
the outer call is just
intercept_brk():
PATCHER_BEGIN
_intercept_brk()
PATCHER_END
and the inner call _intercept_brk() is where the bulk of the code goes.

What I'm seeing is that _intercept_brk() is being inlined and probably
negating Nathan's fix. So I want to add opal_attribute_noinline to
restore the fix.

With this commit in place, the disassembly of intercept_brk becomes tiny
because it's no longer inlining intercept_brk() and the susipicious
early save of r2 is gone. I made the same fix to all the intercept*
functions, although intercept_brk was the only one that had a suspicious
save of r2.

As far as empirical failures though, we only have those from the non-OMPI
product that's using the same patcher code. I'm not actually getting OMPI
to fail from the above suspicious data being saved in r1+24.

Signed-off-by: Mark Allen [email protected]
(cherry picked from commit ddd1f57)

This bug was first seen in a different product that's using the same interception code as OMPI. But I think it's potentially in OMPI too. In my vanilla build of OMPI master on RH8 if I "gdb libopen-pal.so" and "disassemble intercept_brk", I'm seeing a suspicious extra instruction in front of PATCHER_BEGIN: 0x00000000000d6778 <+40>: std r2,24(r1) // something gcc put in front 0x00000000000d677c <+44>: std r2,96(r1) // PATCHER_BEGIN's toc_save 0x00000000000d6780 <+48>: nop // NOPs from PATCHER_BEGIN 0x00000000000d6784 <+52>: nop // that get replaced 0x00000000000d6788 <+56>: nop // by instructions that 0x00000000000d678c <+60>: nop // change r2 0x00000000000d6790 <+64>: nop // Later there are loads from that location like 0x000000000019e0e4 <+132>: ld r2,24(r1) that make me nervous since that's the pre-updated value. I believe this is the same thing Nathan is describing way back in a9bc692 and his solution was to put a second call around each interception, where the outer call is just intercept_brk(): PATCHER_BEGIN _intercept_brk() PATCHER_END and the inner call _intercept_brk() is where the bulk of the code goes. What I'm seeing is that _intercept_brk() is being inlined and probably negating Nathan's fix. So I want to add __opal_attribute_noinline__ to restore the fix. With this commit in place, the disassembly of intercept_brk becomes tiny because it's no longer inlining _intercept_brk() and the susipicious early save of r2 is gone. I made the same fix to all the intercept_* functions, although intercept_brk was the only one that had a suspicious save of r2. As far as empirical failures though, we only have those from the non-OMPI product that's using the same patcher code. I'm not actually getting OMPI to fail from the above suspicious data being saved in r1+24. Signed-off-by: Mark Allen <[email protected]> (cherry picked from commit ddd1f57)

ibm-ompi · 2020-06-17T23:02:32Z

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/c9790547bff3fd7fecd537d0ec073e05

jjhursey · 2020-06-18T13:51:31Z

The crash from CI (seen here ) does not seem related as it is a double free crash in orted (so not impacted by patcher).

Let's try to retest: bot:ibm:gnu:retest

markalle · 2020-06-18T15:53:24Z

I agree it shouldn't be related. Should we try to track down if something separate is broken in v3.1.x orted in the ibm CI environment?

gpaulsen self-requested a review June 17, 2020 23:27

gpaulsen added the enhancement label Jun 17, 2020

gpaulsen added this to the v3.1.7 milestone Jun 17, 2020

gpaulsen approved these changes Jun 18, 2020

View reviewed changes

jsquyres added the Target: v3.1.x label Jun 24, 2020

bwbarrett merged commit fbc8bca into open-mpi:v3.1.x Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

noinline to avoid compiler reading TOC before PATCHER_BEGIN #7839

noinline to avoid compiler reading TOC before PATCHER_BEGIN #7839

Uh oh!

markalle commented Jun 17, 2020

Uh oh!

ibm-ompi commented Jun 17, 2020

Uh oh!

jjhursey commented Jun 18, 2020

Uh oh!

markalle commented Jun 18, 2020

Uh oh!

Uh oh!

noinline to avoid compiler reading TOC before PATCHER_BEGIN #7839

noinline to avoid compiler reading TOC before PATCHER_BEGIN #7839

Uh oh!

Conversation

markalle commented Jun 17, 2020

Uh oh!

ibm-ompi commented Jun 17, 2020

Uh oh!

jjhursey commented Jun 18, 2020

Uh oh!

markalle commented Jun 18, 2020

Uh oh!

Uh oh!