ompi request handling race condition fix (MT-case) #1815

artpol84 · 2016-06-24T19:04:20Z

Described in #1813

artpol84 · 2016-06-24T19:04:54Z

@hjelmn @bosilca @jladd-mlnx @jsquyres @hppritcha

This code needs to be verified, however it would be good if @hjelmn @bosilca can review it in advance.

hjelmn · 2016-06-24T19:19:04Z

I think this needs to be rethought. We do not want a lock in the critical path. Will take a look in a bit.

artpol84 · 2016-06-24T19:31:34Z

@hjelmn, sounds reasonable.
Right now the lock is taken in all cases except when sync->count is greater than updates
https://github.com/open-mpi/ompi/blob/master/opal/threads/wait_sync.h#L87
If we want to preserve this behavior I can rewrite the code like this (just to provide an idea):

static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
{
    // Fast path:
    if(  (sync->count - updates > 0) && OPAL_LIKELY(OPAL_SUCCESS == status) ){
        if( OPAL_ATOMIC_CMPSET_X(&sync->count, sync->count, sync->count - updates) ){
            return;
        }
    }

    // Slow path:
   WAIT_SYNC_LOCK(sync);
    if( OPAL_LIKELY(OPAL_SUCCESS == status) ) {
        if( 0 != (OPAL_THREAD_ADD32(&sync->count, -updates)) ) {
            goto unlock;
        }
    } else {
        /* this is an error path so just use the atomic */
        opal_atomic_swap_32 (&sync->count, 0);
        sync->status = OPAL_ERROR;
    }
    WAIT_SYNC_SIGNAL_UNLOCK(sync);
    return;
unlock:
    WAIT_SYNC_UNLOCK(sync);
}

rhc54 · 2016-06-24T19:33:15Z

if this is purely a yalla requirement, then perhaps it would be better to put the lock in the yalla path instead of in all paths?

hjelmn · 2016-06-24T19:34:02Z

Not sure. I don't see how this race is possible but I am looking into it.

artpol84 · 2016-06-24T19:35:00Z

This race is possible for ALL PMLs. And I was able to reproduce it only on the jenkins server and not on other hosts, so it is quite rare. Which means you definitely don't want to debug this on the customer side.

hjelmn · 2016-06-24T19:35:03Z

But for sure we can not have a lock in this function if there is any way to avoid it.

artpol84 · 2016-06-24T19:36:06Z

@hjelmn, the lock was there originally and I just proposed comparable solution.

artpol84 · 2016-06-24T19:39:23Z

@hjelmn, I described how this race appears in details in #1813. Please have a look.
Right now I have 2500 successful runs (still running) of overlap test compared to 5-10 before the fix.

artpol84 · 2016-06-24T19:51:55Z

Ha! Jenkins likes my patch!

artpol84 · 2016-06-24T19:53:42Z

I will update the PR with fastpath during the weekend.

hjelmn · 2016-06-24T19:53:43Z

@artpol84 The lock is fundamentally different with your change. If a MPI_Wait* call is waiting on multiple requests then the lock is now obtained one each and every request completion instead of just the last one. This will reduce multi-threaded message rates.

artpol84 · 2016-06-24T19:55:36Z

No, with this solution you will only acquire the lock if sync->count will turn to 0, which means that only the last thread will be affected by the lock.
At least this is what I intend to implement.

artpol84 · 2016-06-24T19:57:24Z

We go to this fastpath:

  // Fast path:
    if(  (sync->count - updates > 0) && OPAL_LIKELY(OPAL_SUCCESS == status) ){
        if( OPAL_ATOMIC_CMPSET_X(&sync->count, sync->count, sync->count - updates) ){
            return;
        }
    }

if the count won't turn to 0.
And we will fall through if this will be the last step for the sync->count

hjelmn · 2016-06-24T20:02:10Z

Having a cmpset will not help things if there is contention for the update. It will fail and be another locking atomic slowing the critical path. Not sure we can do better but I want to give it some thought.

jladd-mlnx · 2016-06-24T20:03:37Z

I vote we merge.

artpol84 · 2016-06-24T20:04:05Z

We can do iterations there, this should fix most cases.

hjelmn · 2016-06-24T20:04:17Z

@jladd-mlnx Please back off.

jladd-mlnx · 2016-06-24T20:04:25Z

@hppritcha What's your vote?

jladd-mlnx · 2016-06-24T20:06:22Z

@jsquyres what's your vote? This fixes the issue. We spent a lot of time debugging this at the community's request.

artpol84 · 2016-06-24T20:10:15Z

@hjelmn, I think this will solve your concerns:

static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
{
    // Fast path:
    while(  (sync->count - updates > 0) && OPAL_LIKELY(OPAL_SUCCESS == status) ){
        if( OPAL_ATOMIC_CMPSET_X(&sync->count, sync->count, sync->count - updates) ){
            return;
        }
    }

    // Slow path:
   WAIT_SYNC_LOCK(sync);
    if( OPAL_LIKELY(OPAL_SUCCESS == status) ) {
        if( 0 != (OPAL_THREAD_ADD32(&sync->count, -updates)) ) {
            goto unlock;
        }
    } else {
        /* this is an error path so just use the atomic */
        opal_atomic_swap_32 (&sync->count, 0);
        sync->status = OPAL_ERROR;
    }
    WAIT_SYNC_SIGNAL_UNLOCK(sync);
    return;
unlock:
    WAIT_SYNC_UNLOCK(sync);
}

artpol84 · 2016-06-24T20:11:54Z

I spent 2 days debugging your code, @hjelmn. I want to own this fix

hjelmn · 2016-06-24T20:12:09Z

@artpol84 Not my code.

jladd-mlnx · 2016-06-24T20:12:48Z

@hjelmn Then why the resistance to merge?

artpol84 · 2016-06-24T20:12:51Z

$ git bisect good
b4ff061bd45705eb39a91976fad855315ba990cc is the first bad commit
commit b4ff061bd45705eb39a91976fad855315ba990cc
Author: Nathan Hjelm <[email protected]>
Date:   Wed May 25 15:42:53 2016 -0600

    pml/yalla: update for request changes

    This commit brings the pml/yalla component up to date with the request
    rework changes.

    Signed-off-by: Nathan Hjelm <[email protected]>

    (cherry picked from open-mpi/ompi@9d439664f074a690ff8c550f9ef489bd70520a0f)

    Signed-off-by: Nathan Hjelm <[email protected]>

:040000 040000 c8047a13774043cf31793730b23f9fcc3cfeb1d6 38e5a2e275229f11f92612f7f1bc541cd9147017 M      ompi

gpaulsen · 2016-06-24T20:13:04Z

@nysal - can you please review and comment also?

artpol84 · 2016-06-24T20:14:22Z

@gpaulsen Considering updates I mentioned in the comments.
Any change needs to be passed through my jenkins anyway (the only host that provides reproduction).

artpol84 · 2016-06-24T20:17:52Z

@rhc54 sure!

artpol84 · 2016-06-24T20:18:39Z

@jsquyres, agree.

jladd-mlnx · 2016-06-24T20:19:03Z

Once again, Thanks Mellanox Jenkins.

artpol84 · 2016-06-24T20:19:29Z

Let me update and test the code. And let's discuss afterwards.
This needs to go to the nearest release. What are the deadlines?

hjelmn · 2016-06-24T20:21:24Z

This is sufficient to defeat the race:

diff --git a/opal/threads/wait_sync.h b/opal/threads/wait_sync.h
index b29f01c..1e9b0e9 100644
--- a/opal/threads/wait_sync.h
+++ b/opal/threads/wait_sync.h
@@ -24,6 +24,7 @@ typedef struct ompi_wait_sync_t {
     pthread_mutex_t lock;
     struct ompi_wait_sync_t *next;
     struct ompi_wait_sync_t *prev;
+    bool signalling;
 } ompi_wait_sync_t;

 #define REQUEST_PENDING        (void*)0L
@@ -33,14 +34,16 @@ typedef struct ompi_wait_sync_t {

 #define WAIT_SYNC_RELEASE(sync)                       \
     if (opal_using_threads()) {                       \
-       pthread_cond_destroy(&(sync)->condition);      \
-       pthread_mutex_destroy(&(sync)->lock);          \
+        while ((sync)->signalling);                   \
+        pthread_cond_destroy(&(sync)->condition);     \
+        pthread_mutex_destroy(&(sync)->lock);         \
     }

 #define WAIT_SYNC_SIGNAL(sync)                        \
     if (opal_using_threads()) {                       \
         pthread_mutex_lock(&(sync->lock));            \
         pthread_cond_signal(&sync->condition);        \
+        sync->signalling = false;                     \
         pthread_mutex_unlock(&(sync->lock));          \
     }

@@ -61,6 +64,7 @@ static inline int sync_wait_st (ompi_wait_sync_t *sync)
         (sync)->next = NULL;                                    \
         (sync)->prev = NULL;                                    \
         (sync)->status = 0;                                     \
+        (sync)->signalling = false;                             \
         if (opal_using_threads()) {                             \
             pthread_cond_init (&(sync)->condition, NULL);       \
             pthread_mutex_init (&(sync)->lock, NULL);           \
@@ -75,6 +79,7 @@ static inline int sync_wait_st (ompi_wait_sync_t *sync)
  */
 static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
 {
+    sync->signalling = true;
     if( OPAL_LIKELY(OPAL_SUCCESS == status) ) {
         if( 0 != (OPAL_THREAD_ADD32(&sync->count, -updates)) ) {
             return;

Fixed a couple typos.

artpol84 · 2016-06-24T20:22:11Z

@hjelmn have you verified that?

hjelmn · 2016-06-24T20:23:49Z

@artpol84 There is no way it wouldn't. The race is the sync getting destroyed before the signaling is finished. By marking it as being in the process of being signaled it should be clear without using a single extra atomic. Extra atomics and locks were my resistance to the fix as presented. Otherwise it was fantastic work.

jsquyres · 2016-06-24T20:23:57Z

Once again, thanks to the overall community for developing a great code base with contributions from many parties. None of it would be possible without _all_ of us: those who write the code, those who review the code, those who test the code, those who track down / fix bugs in the code, ...etc.

I think the technical discussions today have been excellent; I appreciate @artpol84 and @hjelmn working to figure out the root cause and come up with a good solution that works for everyone.

hppritcha · 2016-06-24T20:24:01Z

@artpol84 could you go ahead with the update your propose. Then have @bosilca review it.

artpol84 · 2016-06-24T20:24:44Z

@hppritcha ok, will do that tomorrow. Too late for that now.

jsquyres · 2016-06-24T20:57:26Z

Cross reference for github connection: #1816 is where @hjelmn made a PR with his suggestion to run it through the Mellanox Jenkins.

artpol84 · 2016-06-25T07:22:15Z

@hppritcha I have updated the PR as we agreed. Under stress testing now, will update later.

artpol84 · 2016-06-25T09:11:30Z

Sounds like rdmacm problem @hjelmn was mentioning. However I have the fix for it here.

artpol84 · 2016-06-25T09:11:37Z

bot:retest

Described in open-mpi#1813

artpol84 · 2016-06-26T18:48:00Z

@jladd-mlnx @bosilca @hjelmn
After investigation of my original approach it turns out that indeed @hjelmn approach better in cases of concurrent access.
But I can suggest the new version that is using existing sync->count instead of introducing the new variable. The #1815 is updated.
Advantages:

minimalistic
solves the problem in the same way as opal/sync: fix race condition #1816
avoids additional locking around conditional as @hjelmn suggested

hjelmn · 2016-06-26T19:36:41Z

@artpol84 Nice. That will do it.

hjelmn · 2016-06-26T19:37:33Z

BTW, can you put the continue statement and the comment above WAIT_SYNC_RELEASE from my commit. My copyright is already up-to-date on this file.

artpol84 · 2016-06-26T19:52:40Z

Ok, will do tomorrow

понедельник, 27 июня 2016 г. пользователь Nathan Hjelm написал:

BTW, can you put the continue statement and the comment above
WAIT_SYNC_RELEASE from my commit. My copyright is already up-to-date on
this file.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1815 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5PoO6TAzYR3BN4-O20j5bXWkxL-cTks5qPtUBgaJpZM4I-Cev
.

Best regards, Artem Polyakov
(Mobile mail)

bosilca · 2016-06-26T21:58:45Z

@artpol84 I have a strong doubt about this patch. The count can be negative for waitany and waitsome.

hjelmn · 2016-06-27T02:12:14Z

@bosilca is right. The count can go negative in waitany and waitsome. There may be no way to do this without an extra member.

artpol84 · 2016-06-27T06:56:42Z

@bosilca thank you. I was mostly consider wait_all case coz I was debugging it. I had suspection that this might not work for other.
I'm closing this PR and let's stick to #1816.

artpol84 mentioned this pull request Jun 24, 2016

Overlap test sometimes failing in Mellanox Jenkins with Yalla #1813

Closed

artpol84 added the bug label Jun 24, 2016

artpol84 added this to the v2.0.0 milestone Jun 24, 2016

jladd-mlnx added the Severity: blocker label Jun 24, 2016

artpol84 force-pushed the fix_req branch 2 times, most recently from 127f701 to 689199c Compare June 25, 2016 07:21

artpol84 force-pushed the fix_req branch 2 times, most recently from 6e60cf7 to 65f222b Compare June 25, 2016 07:29

ompi request handling race condition fix (MT-case)

9ce2ccc

Described in open-mpi#1813

artpol84 force-pushed the fix_req branch from 65f222b to 9ce2ccc Compare June 26, 2016 18:42

artpol84 mentioned this pull request Jun 26, 2016

opal/sync: fix race condition #1816

Merged

artpol84 closed this Jun 27, 2016

ompi request handling race condition fix (MT-case) #1815

ompi request handling race condition fix (MT-case) #1815

Uh oh!

Conversation

artpol84 commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

hjelmn commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Jun 24, 2016

Uh oh!

hjelmn commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjelmn commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

hjelmn commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

hjelmn commented Jun 24, 2016

Uh oh!

jladd-mlnx commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

hjelmn commented Jun 24, 2016

Uh oh!

jladd-mlnx commented Jun 24, 2016

Uh oh!

jladd-mlnx commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

hjelmn commented Jun 24, 2016

Uh oh!

jladd-mlnx commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

gpaulsen commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

jladd-mlnx commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

hjelmn commented Jun 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artpol84 commented Jun 24, 2016

Uh oh!

hjelmn commented Jun 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsquyres commented Jun 24, 2016

Uh oh!

artpol84 commented Jun 24, 2016 •

edited

Loading

artpol84 commented Jun 24, 2016 •

edited

Loading

hjelmn commented Jun 24, 2016 •

edited

Loading

hjelmn commented Jun 24, 2016 •

edited

Loading