Skip to content

opal/sync: fix race condition #1816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 28, 2016
Merged

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Jun 24, 2016

This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes #1813

Signed-off-by: Nathan Hjelm [email protected]

@hjelmn
Copy link
Member Author

hjelmn commented Jun 24, 2016

@artpol84 Getting this running through Jenkins. Will try to get it retested multiple times to see if it fails. In theory it shouldn't.

@jladd-mlnx
Copy link
Member

bot:retest

@hjelmn
Copy link
Member Author

hjelmn commented Jun 24, 2016

:bot:retest:

1 similar comment
@hjelmn
Copy link
Member Author

hjelmn commented Jun 24, 2016

:bot:retest:

@jladd-mlnx
Copy link
Member

Retest after adding the original Jenkins command line with ob1 (allow UCX to load and unload.)

bot:retest

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

I can make this even faster as the lock/unlock is not really needed for the condition signal here. Will leave that for later though.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

Looking good. Sending through again.

:bot:retest:

@artpol84
Copy link
Contributor

artpol84 commented Jun 25, 2016

@hjelmn I guess that race condition is still possible here

  1. (sync)->signalling is initialized to false and this is correct because you may wait late (after all wait_sync_update was called)
    2, Assume instruction reordering in wait_sync_update (correct be if I'm wrong, but it's possible):
count PML progress MPI_Wait thread
1 OPAL_THREAD_ADD32(&sync->count, -1)
0 if( sunc->count > 0 ) - false => exit the loop
0 running till the end and free'ing sync
0 sync->signalling = true
0 WAIT_SYNC_SIGNAL(sync); - error

Same error here. I also assume there will be problems with several independently running PML progress threads.

So it seems that you need a memory fence here:

static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
  {
     sync->signalling = true;

     mem_fence(); // <---------   

     if( OPAL_LIKELY(OPAL_SUCCESS == status) ) {
          if( 0 != (OPAL_THREAD_ADD32(&sync->count, -updates)) ) {
              return;
         }
     } else {
         /* this is an error path so just use the atomic */
         opal_atomic_swap_32 (&sync->count, 0);
         sync->status = OPAL_ERROR;
     }
     WAIT_SYNC_SIGNAL(sync);
 }

This race will be much more rare compared to original one, and harder to track.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

Instruction reordering will not happen. pthread calls are by definition effectively a memory barrier. We could add an isync just to be safe if you want think it is warranted. We have isync for both ppc and arm which are the supported platforms that do instruction reordering.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

Oh, wait. before the atomic. Sorry, didn't catch that part when looking at the email version. You are absolutely correct.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

An atomic wmb would be fine there. An isync as I suggested might be too weak.

@artpol84
Copy link
Contributor

Setting of signal and changing count doesn't have pthreads, they are coming
lately

суббота, 25 июня 2016 г. пользователь Nathan Hjelm написал:

Instruction reordering will not happen. pthread calls are effectively a
memory barrier. We could add an isync just to be safe if you want think it
is warranted. We have isync for both ppc and arm.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1816 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5Pkr67OoyDmNxE209ANgTPm1ZPMfpks5qPKI0gaJpZM4I-HOt
.


Best regards, Artem Polyakov
(Mobile mail)

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

@artpol84 Yup. Just caught that after hitting comment :)

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

Will add the barrier now.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

And correct to the more common signaling spelling :D

@hjelmn hjelmn force-pushed the request_perfm_regression branch from 7c30bff to d204f67 Compare June 25, 2016 03:42
@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

@artpol84 Thanks for catching that. Should be fixed now.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

Won't commit this until I get a +1 from @artpol84.

Looks like we had a finalize hang. Looks unrelated. @artpol84 Any clue as to what happened? Hang wasn't even a threading test. edit Looks like an rdmacm bug. I don't own that code but I will try to take a look.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

:bot:retest:

@hjelmn
Copy link
Member Author

hjelmn commented Jun 25, 2016

@artpol84 Feel free to open a PR vs my branch and add the mellanox copyright. I prefer the commit adding a copyright line be from a member the copyrighting organization.

@@ -75,6 +89,9 @@ static inline int sync_wait_st (ompi_wait_sync_t *sync)
*/
static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
{
sync->signaling = true;
/* ensure the signalling value is commited before updating the count */
opal_atomic_wmb ();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn I may be wrong but it seems that wmb addresses CPU-level reordering possibility.
Do we address compiler reordering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean adding something like asm volatile("" ::: "memory"); may be needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is already done in wmb()

This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi#1813

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn hjelmn force-pushed the request_perfm_regression branch from d204f67 to fb455f0 Compare June 27, 2016 02:14
@artpol84
Copy link
Contributor

artpol84 commented Jun 27, 2016

Looks like waitany and waitsome are broken. I filed the PR to @hjelmn branch:
hjelmn@42e6251

Can be easily reproduced with following modifications of overlap test:
overlap_waitany.txt
overlap_waitsome.txt

@hjelmn @bosilca please, check.

@artpol84
Copy link
Contributor

@jladd-mlnx @miked-mellanox I think we (Mellanox) will need to include this tests to our jenkins/MTT suite.
They exercise both corner cases: late completion and early completion of all requests.
We can also add versions without delay to exercise other possible paths.
Since those are race conditions - probably multiple executions is needed.

@artpol84
Copy link
Contributor

@jsquyres, sure, we can do that as well.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 27, 2016

@bosilca, @artpol84 Are we all ok with this change the way it is? @artpol84 You can close your PR. Due to a rebase it would no longer merge so I cherry-picked the commit.

@artpol84
Copy link
Contributor

@hjelmn, what do you mean? I don't see the wait some waitany fix in this Pr.

@bosilca
Copy link
Member

bosilca commented Jun 27, 2016

I think this patch fixes the original issue, and as such it is ready to go. The performance issues discovered meanwhile should be fixed by a subsequent ticket, once we have the correct fix.

@artpol84
Copy link
Contributor

There was a) a bug and b) performance issue.
I PRed a bug fix for wait some and waitany here.
And perf is pending on another request

вторник, 28 июня 2016 г. пользователь bosilca написал:

I think this patch fixes the original issue, and as such it is ready to
go. The performance issues discovered meanwhile should be fixed by a
subsequent ticket, once we have the correct fix.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1816 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5PoIxz2ev2r02WH_ZDSnOaj2igRMEks5qQF3ogaJpZM4I-HOt
.


Best regards, Artem Polyakov
(Mobile mail)

@artpol84
Copy link
Contributor

hjelmn@42e6251
This is a bug fix

@artpol84
Copy link
Contributor

Without this bugfix the following tests (already mentioned above) are hanging 100%
overlap_waitany.txt
overlap_waitsome.txt

I think PR must fix without new hang introduction.

@bosilca
Copy link
Member

bosilca commented Jun 28, 2016

42e6251 introduces its own issues (I commented on it on the other issue about the possible race condition). A possible fix is to generate a sync_update (which will take care of the signaling field) in the wait* operation if there is nothing to wait for.

@artpol84
Copy link
Contributor

@bosilca I think that your comment #1820 (comment) is reasonable for MPI_Waitany only. So in my understanding it is related to this PR because I want this commit 42e6251 to go along with this PR.
I will fix that now. For MPI_Waitsome I believe we are fine.

@artpol84
Copy link
Contributor

@hjelmn I rebased my PR hjelmn#9 and added the fix for the problem that @bosilca pointed, please don't rebase untill we agreed here.

@bosilca please have a look.

@artpol84
Copy link
Contributor

@bosilca btw I solved the problem with Waitany with sync_sets/unsets instead of sync_update. I can do the sync_update but with sync_sets/unsets we avoiding one atomic. What do you think?

@artpol84
Copy link
Contributor

artpol84 commented Jun 28, 2016

ohh, and according to the backtrace I had when I was debugging:
open-mpi/ompi-release#1240 (comment)

sync_wait will then try to call condition which has internal lock (see the backtrace of the second thread).
So I guess we want to avoid sync_wait call when possible.

@artpol84
Copy link
Contributor

@bosilca @hjelmn I'll be on today's OMPI call available for discussion.

@bosilca
Copy link
Member

bosilca commented Jun 28, 2016

I wont be able to join the call today.

@artpol84
Copy link
Contributor

I think I've got all your points now. Will update PRs later today. Thank
you!

вторник, 28 июня 2016 г. пользователь bosilca написал:

I wont be able to join the call today.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1816 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5PqkZyU1bvP8o1DYuNz5cejjaA_t7ks5qQPCHgaJpZM4I-HOt
.


Best regards, Artem Polyakov
(Mobile mail)

@artpol84
Copy link
Contributor

Again updated the PR hjelmn#9 with slightly improved MPI_Waitsome performance according @bosilca suggestion

@jladd-mlnx
Copy link
Member

@hjelmn @artpol84 @bosilca Do we have any idea why it's consistently hanging in init when using RDMACM? This is new behavior.

@artpol84
Copy link
Contributor

I don't think it's related. The proble is in MPI Init. Unlikely request

вторник, 28 июня 2016 г. пользователь Joshua Ladd написал:

@hjelmn https://github.com/hjelmn @artpol84
https://github.com/artpol84 @bosilca https://github.com/bosilca Do we
have any idea why it's consistently hanging in init when using RDMACM? This
is new behavior.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1816 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5PiE6ZZUNF2sBKtWmNBxl5sU2cfgqks5qQRi7gaJpZM4I-HOt
.


Best regards, Artem Polyakov
(Mobile mail)

@hjelmn
Copy link
Member Author

hjelmn commented Jun 28, 2016

Well, it shouldn't be crashing. It should be exiting. This is the problem:

06:44:23 --------------------------------------------------------------------------
06:44:23 No OpenFabrics connection schemes reported that they were able to be
06:44:23 used on a specific port.  As such, the openib BTL (OpenFabrics
06:44:23 support) will be disabled for this port.
06:44:23 
06:44:23   Local host:           jenkins01
06:44:23   Local device:         mlx5_0
06:44:23   Local port:           1
06:44:23   CPCs attempted:       rdmacm
06:44:23 --------------------------------------------------------------------------

rdmacm is working for me. Not sure why it is failing to load on Mellanox Jenkins.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 28, 2016

Unless that is the other port again. Maybe the openib btl is not correctly disqualifying that port?

@hjelmn
Copy link
Member Author

hjelmn commented Jun 28, 2016

BTW, if you need udcm to work with a router it wouldn't take much work. No need to use rdmacm for that :).

@artpol84
Copy link
Contributor

I will check tomorrow.

(request handling related)
@artpol84
Copy link
Contributor

@hjelmn could you accept my PR?
@bosilca approved it.

Request handling: fix MPI_Waitany and MPI_Waitsome
@hjelmn
Copy link
Member Author

hjelmn commented Jun 28, 2016

@artpol84 Will merge after jenkins finishes.

@jsquyres
Copy link
Member

@hjelmn @artpol84 We're less than 45 mins from the call -- do you want to wait for merging whichever PRs are going to get merged until we can do final coordination on the phone?

@artpol84
Copy link
Contributor

Sure.
I was talking about my PR to @hjelmn branch. Let's talk.

@hjelmn hjelmn merged commit 955269b into open-mpi:master Jun 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overlap test sometimes failing in Mellanox Jenkins with Yalla
5 participants