opal/sync: fix race condition #1816

hjelmn · 2016-06-24T20:47:51Z

This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes #1813

Signed-off-by: Nathan Hjelm [email protected]

hjelmn · 2016-06-24T20:48:21Z

@artpol84 Getting this running through Jenkins. Will try to get it retested multiple times to see if it fails. In theory it shouldn't.

jladd-mlnx · 2016-06-24T21:57:12Z

bot:retest

hjelmn · 2016-06-24T22:54:14Z

:bot:retest:

hjelmn · 2016-06-24T23:20:16Z

:bot:retest:

jladd-mlnx · 2016-06-24T23:59:32Z

Retest after adding the original Jenkins command line with ob1 (allow UCX to load and unload.)

bot:retest

hjelmn · 2016-06-25T00:05:25Z

I can make this even faster as the lock/unlock is not really needed for the condition signal here. Will leave that for later though.

hjelmn · 2016-06-25T00:15:58Z

Looking good. Sending through again.

:bot:retest:

artpol84 · 2016-06-25T02:50:16Z

@hjelmn I guess that race condition is still possible here

(sync)->signalling is initialized to false and this is correct because you may wait late (after all wait_sync_update was called)
2, Assume instruction reordering in wait_sync_update (correct be if I'm wrong, but it's possible):

count	PML progress	MPI_Wait thread
1	OPAL_THREAD_ADD32(&sync->count, -1)
0		if( sunc->count > 0 ) - false => exit the loop
0		running till the end and free'ing `sync`
0	sync->signalling = true
0	WAIT_SYNC_SIGNAL(sync); - error

Same error here. I also assume there will be problems with several independently running PML progress threads.

So it seems that you need a memory fence here:

static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
  {
     sync->signalling = true;

     mem_fence(); // <---------   

     if( OPAL_LIKELY(OPAL_SUCCESS == status) ) {
          if( 0 != (OPAL_THREAD_ADD32(&sync->count, -updates)) ) {
              return;
         }
     } else {
         /* this is an error path so just use the atomic */
         opal_atomic_swap_32 (&sync->count, 0);
         sync->status = OPAL_ERROR;
     }
     WAIT_SYNC_SIGNAL(sync);
 }

This race will be much more rare compared to original one, and harder to track.

hjelmn · 2016-06-25T03:36:17Z

Instruction reordering will not happen. pthread calls are by definition effectively a memory barrier. We could add an isync just to be safe if you want think it is warranted. We have isync for both ppc and arm which are the supported platforms that do instruction reordering.

hjelmn · 2016-06-25T03:37:51Z

Oh, wait. before the atomic. Sorry, didn't catch that part when looking at the email version. You are absolutely correct.

hjelmn · 2016-06-25T03:38:25Z

An atomic wmb would be fine there. An isync as I suggested might be too weak.

artpol84 · 2016-06-25T03:39:12Z

Setting of signal and changing count doesn't have pthreads, they are coming
lately

суббота, 25 июня 2016 г. пользователь Nathan Hjelm написал:

Instruction reordering will not happen. pthread calls are effectively a
memory barrier. We could add an isync just to be safe if you want think it
is warranted. We have isync for both ppc and arm.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1816 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5Pkr67OoyDmNxE209ANgTPm1ZPMfpks5qPKI0gaJpZM4I-HOt
.

Best regards, Artem Polyakov
(Mobile mail)

hjelmn · 2016-06-25T03:39:37Z

@artpol84 Yup. Just caught that after hitting comment :)

hjelmn · 2016-06-25T03:39:53Z

Will add the barrier now.

hjelmn · 2016-06-25T03:40:07Z

And correct to the more common signaling spelling :D

hjelmn · 2016-06-25T03:42:23Z

@artpol84 Thanks for catching that. Should be fixed now.

hjelmn · 2016-06-25T04:42:39Z

Won't commit this until I get a +1 from @artpol84.

Looks like we had a finalize hang. Looks unrelated. @artpol84 Any clue as to what happened? Hang wasn't even a threading test. edit Looks like an rdmacm bug. I don't own that code but I will try to take a look.

hjelmn · 2016-06-25T04:47:06Z

:bot:retest:

hjelmn · 2016-06-25T13:55:22Z

@artpol84 Feel free to open a PR vs my branch and add the mellanox copyright. I prefer the commit adding a copyright line be from a member the copyrighting organization.

artpol84 · 2016-06-26T05:24:57Z

opal/threads/wait_sync.h

@@ -75,6 +89,9 @@ static inline int sync_wait_st (ompi_wait_sync_t *sync)
 */
 static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
 {
+    sync->signaling = true;
+    /* ensure the signalling value is commited before updating the count */
+    opal_atomic_wmb ();


@hjelmn I may be wrong but it seems that wmb addresses CPU-level reordering possibility.
Do we address compiler reordering?

I mean adding something like asm volatile("" ::: "memory"); may be needed

Ok, this is already done in wmb()

@artpol84

This commit fixes a race condition discovered by @artpol84. The race happens when a signalling thread decrements the sync count to 0 then goes to sleep. If the waiting thread runs and detects the count == 0 before going to sleep on the condition variable it will destroy the condition variable while the signalling thread is potentially still processing the completion. The fix is to add a non-atomic member to the sync structure that indicates another process is handling completion. Since the member will only be set to false by the initiating thread and the completing thread the variable does not need to be protected. When destoying a condition variable the waiting thread needs to wait until the singalling thread is finished. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1813 Signed-off-by: Nathan Hjelm <[email protected]>

artpol84 · 2016-06-27T10:07:16Z

Looks like waitany and waitsome are broken. I filed the PR to @hjelmn branch:
hjelmn@42e6251

Can be easily reproduced with following modifications of overlap test:
overlap_waitany.txt
overlap_waitsome.txt

@hjelmn @bosilca please, check.

artpol84 · 2016-06-27T10:09:43Z

@jladd-mlnx @miked-mellanox I think we (Mellanox) will need to include this tests to our jenkins/MTT suite.
They exercise both corner cases: late completion and early completion of all requests.
We can also add versions without delay to exercise other possible paths.
Since those are race conditions - probably multiple executions is needed.

artpol84 · 2016-06-27T14:21:30Z

@jsquyres, sure, we can do that as well.

hjelmn · 2016-06-27T20:19:52Z

@bosilca, @artpol84 Are we all ok with this change the way it is? @artpol84 You can close your PR. Due to a rebase it would no longer merge so I cherry-picked the commit.

artpol84 · 2016-06-27T21:16:34Z

@hjelmn, what do you mean? I don't see the wait some waitany fix in this Pr.

bosilca · 2016-06-27T23:33:57Z

I think this patch fixes the original issue, and as such it is ready to go. The performance issues discovered meanwhile should be fixed by a subsequent ticket, once we have the correct fix.

artpol84 · 2016-06-28T00:23:30Z

There was a) a bug and b) performance issue.
I PRed a bug fix for wait some and waitany here.
And perf is pending on another request

вторник, 28 июня 2016 г. пользователь bosilca написал:

I think this patch fixes the original issue, and as such it is ready to
go. The performance issues discovered meanwhile should be fixed by a
subsequent ticket, once we have the correct fix.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1816 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5PoIxz2ev2r02WH_ZDSnOaj2igRMEks5qQF3ogaJpZM4I-HOt
.

Best regards, Artem Polyakov
(Mobile mail)

artpol84 · 2016-06-28T00:24:14Z

hjelmn@42e6251
This is a bug fix

artpol84 · 2016-06-28T00:37:53Z

Without this bugfix the following tests (already mentioned above) are hanging 100%
overlap_waitany.txt
overlap_waitsome.txt

I think PR must fix without new hang introduction.

bosilca · 2016-06-28T00:38:48Z

42e6251 introduces its own issues (I commented on it on the other issue about the possible race condition). A possible fix is to generate a sync_update (which will take care of the signaling field) in the wait* operation if there is nothing to wait for.

artpol84 · 2016-06-28T00:46:11Z

@bosilca I think that your comment #1820 (comment) is reasonable for MPI_Waitany only. So in my understanding it is related to this PR because I want this commit 42e6251 to go along with this PR.
I will fix that now. For MPI_Waitsome I believe we are fine.

artpol84 · 2016-06-28T01:04:28Z

@hjelmn I rebased my PR hjelmn#9 and added the fix for the problem that @bosilca pointed, please don't rebase untill we agreed here.

@bosilca please have a look.

artpol84 · 2016-06-28T01:13:45Z

@bosilca btw I solved the problem with Waitany with sync_sets/unsets instead of sync_update. I can do the sync_update but with sync_sets/unsets we avoiding one atomic. What do you think?

artpol84 · 2016-06-28T01:22:06Z

ohh, and according to the backtrace I had when I was debugging:
open-mpi/ompi-release#1240 (comment)

sync_wait will then try to call condition which has internal lock (see the backtrace of the second thread).
So I guess we want to avoid sync_wait call when possible.

artpol84 · 2016-06-28T08:51:32Z

@bosilca @hjelmn I'll be on today's OMPI call available for discussion.

bosilca · 2016-06-28T09:59:32Z

I wont be able to join the call today.

artpol84 · 2016-06-28T10:01:53Z

I think I've got all your points now. Will update PRs later today. Thank
you!

вторник, 28 июня 2016 г. пользователь bosilca написал:

I wont be able to join the call today.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1816 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5PqkZyU1bvP8o1DYuNz5cejjaA_t7ks5qQPCHgaJpZM4I-HOt
.

Best regards, Artem Polyakov
(Mobile mail)

artpol84 · 2016-06-28T10:31:10Z

Again updated the PR hjelmn#9 with slightly improved MPI_Waitsome performance according @bosilca suggestion

jladd-mlnx · 2016-06-28T12:51:03Z

@hjelmn @artpol84 @bosilca Do we have any idea why it's consistently hanging in init when using RDMACM? This is new behavior.

artpol84 · 2016-06-28T12:56:59Z

I don't think it's related. The proble is in MPI Init. Unlikely request

вторник, 28 июня 2016 г. пользователь Joshua Ladd написал:

@hjelmn https://github.com/hjelmn @artpol84
https://github.com/artpol84 @bosilca https://github.com/bosilca Do we
have any idea why it's consistently hanging in init when using RDMACM? This
is new behavior.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1816 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5PiE6ZZUNF2sBKtWmNBxl5sU2cfgqks5qQRi7gaJpZM4I-HOt
.

Best regards, Artem Polyakov
(Mobile mail)

hjelmn · 2016-06-28T12:58:29Z

Well, it shouldn't be crashing. It should be exiting. This is the problem:

06:44:23 --------------------------------------------------------------------------
06:44:23 No OpenFabrics connection schemes reported that they were able to be
06:44:23 used on a specific port.  As such, the openib BTL (OpenFabrics
06:44:23 support) will be disabled for this port.
06:44:23 
06:44:23   Local host:           jenkins01
06:44:23   Local device:         mlx5_0
06:44:23   Local port:           1
06:44:23   CPCs attempted:       rdmacm
06:44:23 --------------------------------------------------------------------------

rdmacm is working for me. Not sure why it is failing to load on Mellanox Jenkins.

hjelmn · 2016-06-28T12:58:55Z

Unless that is the other port again. Maybe the openib btl is not correctly disqualifying that port?

hjelmn · 2016-06-28T13:01:31Z

BTW, if you need udcm to work with a router it wouldn't take much work. No need to use rdmacm for that :).

artpol84 · 2016-06-28T13:19:38Z

I will check tomorrow.

(request handling related)

artpol84 · 2016-06-28T13:54:45Z

@hjelmn could you accept my PR?
@bosilca approved it.

Request handling: fix MPI_Waitany and MPI_Waitsome

hjelmn · 2016-06-28T14:10:48Z

@artpol84 Will merge after jenkins finishes.

jsquyres · 2016-06-28T14:18:07Z

@hjelmn @artpol84 We're less than 45 mins from the call -- do you want to wait for merging whichever PRs are going to get merged until we can do final coordination on the phone?

artpol84 · 2016-06-28T14:23:38Z

Sure.
I was talking about my PR to @hjelmn branch. Let's talk.

jsquyres mentioned this pull request Jun 24, 2016

ompi request handling race condition fix (MT-case) #1815

Closed

hjelmn force-pushed the request_perfm_regression branch from 7c30bff to d204f67 Compare June 25, 2016 03:42

artpol84 mentioned this pull request Jun 25, 2016

Overlap test sometimes failing in Mellanox Jenkins with Yalla #1813

Closed

artpol84 reviewed Jun 26, 2016
View reviewed changes

hjelmn force-pushed the request_perfm_regression branch from d204f67 to fb455f0 Compare June 27, 2016 02:14

Fix Mellanox copyright.

8d011ea

artpol84 mentioned this pull request Jun 27, 2016

MPI_Waitsome performance improvement #1820

Closed

artpol84 mentioned this pull request Jun 28, 2016

MPI_Waitsome performance improvement (version #2) #1821

Merged

Fix MPI_Waitany and MPI_Waitsome

5417155

(request handling related)

Merge pull request #9 from artpol84/fix_waitany

a6b3b1f

Request handling: fix MPI_Waitany and MPI_Waitsome

hjelmn merged commit 955269b into open-mpi:master Jun 28, 2016

opal/sync: fix race condition #1816

opal/sync: fix race condition #1816

Conversation

hjelmn commented Jun 24, 2016

hjelmn commented Jun 24, 2016

jladd-mlnx commented Jun 24, 2016

hjelmn commented Jun 24, 2016

hjelmn commented Jun 24, 2016

jladd-mlnx commented Jun 24, 2016

hjelmn commented Jun 25, 2016

hjelmn commented Jun 25, 2016

artpol84 commented Jun 25, 2016 • edited Loading

hjelmn commented Jun 25, 2016 • edited Loading

hjelmn commented Jun 25, 2016 • edited Loading

hjelmn commented Jun 25, 2016 • edited Loading

artpol84 commented Jun 25, 2016

hjelmn commented Jun 25, 2016

hjelmn commented Jun 25, 2016

hjelmn commented Jun 25, 2016

hjelmn commented Jun 25, 2016

hjelmn commented Jun 25, 2016 • edited Loading

hjelmn commented Jun 25, 2016

hjelmn commented Jun 25, 2016

artpol84 Jun 26, 2016

Choose a reason for hiding this comment

artpol84 Jun 26, 2016

Choose a reason for hiding this comment

artpol84 Jun 26, 2016

Choose a reason for hiding this comment

artpol84 commented Jun 27, 2016 • edited Loading

artpol84 commented Jun 27, 2016

artpol84 commented Jun 27, 2016

hjelmn commented Jun 27, 2016

artpol84 commented Jun 27, 2016

bosilca commented Jun 27, 2016

artpol84 commented Jun 28, 2016

artpol84 commented Jun 28, 2016

artpol84 commented Jun 28, 2016

bosilca commented Jun 28, 2016

artpol84 commented Jun 28, 2016

artpol84 commented Jun 28, 2016

artpol84 commented Jun 28, 2016

artpol84 commented Jun 28, 2016 • edited Loading

artpol84 commented Jun 28, 2016

bosilca commented Jun 28, 2016

artpol84 commented Jun 28, 2016

artpol84 commented Jun 28, 2016

jladd-mlnx commented Jun 28, 2016

artpol84 commented Jun 28, 2016

hjelmn commented Jun 28, 2016

hjelmn commented Jun 28, 2016 • edited Loading

hjelmn commented Jun 28, 2016

artpol84 commented Jun 28, 2016

artpol84 commented Jun 28, 2016

hjelmn commented Jun 28, 2016

jsquyres commented Jun 28, 2016

artpol84 commented Jun 28, 2016

artpol84 commented Jun 25, 2016 •

edited

Loading

hjelmn commented Jun 25, 2016 •

edited

Loading

hjelmn commented Jun 25, 2016 •

edited

Loading

hjelmn commented Jun 25, 2016 •

edited

Loading

hjelmn commented Jun 25, 2016 •

edited

Loading

artpol84 commented Jun 27, 2016 •

edited

Loading

artpol84 commented Jun 28, 2016 •

edited

Loading

hjelmn commented Jun 28, 2016 •

edited

Loading