-
Notifications
You must be signed in to change notification settings - Fork 900
opal/asm: change ll/sc atomics to macros #5209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FYI @shamisp, @kawashima-fj |
:bot:retest:ompi |
opal/class/opal_fifo.h
Outdated
opal_list_item_t *tail_item; | ||
|
||
item->opal_list_next = &fifo->opal_fifo_ghost; | ||
assert (NULL == item->opal_list_next); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not require the next to be NULL in the other atomic_push functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some debugging that crept in. Will clean that out now.
@@ -194,7 +196,7 @@ static inline opal_list_item_t *opal_fifo_push_atomic (opal_fifo_t *fifo, | |||
|
|||
opal_atomic_wmb (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this wmb ? there is nothing to flush, except the atomic swap right above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I am not sure I understand how this code can be correct in the case where 2 threads are doing conflicting operations on a FIFO that contains only one element. Imagine a thread that pop while the second one push. The operation at line 202 (tail_item->opal_list_next = item
) conflicts with the opal_atomic_sc_ptr
in the pop, and can lead to a case where the head is NULL (it contains what the previous element assumed was his next) but the tail is not (it contains the result of the previous push).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scenario is tested by test/class/opal_fifo.c. The exhaustive test causes the threads to empty the fifo.
The code that keeps things consistent is at line 276. If the pop detected that the fifo was empty but something was being pushed (next == ghost && tail != item) it waits until the next pointer has been updated then sets the head to the new item. This is the same strategy used by vader and nemesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looking at it there is no way it can get to one of the following lines without the read and write completing anyway. That means the wmb does nothing. It shouldn't hurt though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not critical path, it is fine ...
opal/class/opal_fifo.h
Outdated
@@ -258,11 +271,13 @@ static inline opal_list_item_t *opal_fifo_pop_atomic (opal_fifo_t *fifo) | |||
if (ghost == next) { | |||
void *tmp = item; | |||
|
|||
if (!opal_atomic_compare_exchange_strong_ptr (&fifo->opal_fifo_tail.data.item, &tmp, ghost)) { | |||
if (!opal_atomic_compare_exchange_strong_ptr (&fifo->opal_fifo_tail.data.item, &tmp, (void *) ghost)) { | |||
while (ghost == item->opal_list_next) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do {
opal_atomic_rmb ();
} while (ghost != item->opal_list_next);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, cleaner that way. Changing it now.
: "=&r" (_ret) \ | ||
: "r" (_addr)); \ | ||
\ | ||
ret = (typeof(ret)) _ret; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this conversion work for any signed type ?
This commit fixes a hang that occurs with debug builds of Open MPI on aarch64 and power/powerpc systems. When the ll/sc atomics are inline functions the compiler emits load/store instructions for the function arguments with -O0. These extra load/store arguments can cause the ll reservation to be cancelled causing live-lock. Note that we did attempt to fix this with always_inline but the extra instructions are stil emitted by the compiler (gcc). There may be another fix but this has been tested and is working well. References open-mpi#3697. Close when applied to v3.0.x and v3.1.x. Signed-off-by: Nathan Hjelm <[email protected]>
Signed-off-by: Nathan Hjelm <[email protected]>
/* NTH: the LL/SC support is done through macros due to issues with non-optimized builds. The reason | ||
* is that even with an always_inline attribute the compiler may still emit instructions to store then | ||
* load the arguments to/from the stack. This sequence may cause the ll reservation to be cancelled. */ | ||
#define opal_atomic_ll_32(addr, ret) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. This is useful comment that should be present in some upper layer. It true for Power and Arm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I plan to add some additional documentation detailing what live-lock is with LL/SC.
👍 |
This commit fixes a hang that occurs with debug builds of Open MPI on
aarch64 and power/powerpc systems. When the ll/sc atomics are inline
functions the compiler emits load/store instructions for the function
arguments with -O0. These extra load/store arguments can cause the ll
reservation to be cancelled causing live-lock.
Note that we did attempt to fix this with always_inline but the extra
instructions are stil emitted by the compiler (gcc). There may be
another fix but this has been tested and is working well.
References #3697. Close when applied to v3.0.x and v3.1.x.
Signed-off-by: Nathan Hjelm [email protected]