Skip to content

ompi request handling race condition fix (MT-case) #1815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

artpol84
Copy link
Contributor

Described in #1813

@artpol84
Copy link
Contributor Author

@hjelmn @bosilca @jladd-mlnx @jsquyres @hppritcha

This code needs to be verified, however it would be good if @hjelmn @bosilca can review it in advance.

@hjelmn
Copy link
Member

hjelmn commented Jun 24, 2016

I think this needs to be rethought. We do not want a lock in the critical path. Will take a look in a bit.

@artpol84
Copy link
Contributor Author

artpol84 commented Jun 24, 2016

@hjelmn, sounds reasonable.
Right now the lock is taken in all cases except when sync->count is greater than updates
https://github.com/open-mpi/ompi/blob/master/opal/threads/wait_sync.h#L87
If we want to preserve this behavior I can rewrite the code like this (just to provide an idea):

static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
{
    // Fast path:
    if(  (sync->count - updates > 0) && OPAL_LIKELY(OPAL_SUCCESS == status) ){
        if( OPAL_ATOMIC_CMPSET_X(&sync->count, sync->count, sync->count - updates) ){
            return;
        }
    }

    // Slow path:
   WAIT_SYNC_LOCK(sync);
    if( OPAL_LIKELY(OPAL_SUCCESS == status) ) {
        if( 0 != (OPAL_THREAD_ADD32(&sync->count, -updates)) ) {
            goto unlock;
        }
    } else {
        /* this is an error path so just use the atomic */
        opal_atomic_swap_32 (&sync->count, 0);
        sync->status = OPAL_ERROR;
    }
    WAIT_SYNC_SIGNAL_UNLOCK(sync);
    return;
unlock:
    WAIT_SYNC_UNLOCK(sync);
}

@rhc54
Copy link
Contributor

rhc54 commented Jun 24, 2016

if this is purely a yalla requirement, then perhaps it would be better to put the lock in the yalla path instead of in all paths?

@hjelmn
Copy link
Member

hjelmn commented Jun 24, 2016

Not sure. I don't see how this race is possible but I am looking into it.

@artpol84
Copy link
Contributor Author

artpol84 commented Jun 24, 2016

This race is possible for ALL PMLs. And I was able to reproduce it only on the jenkins server and not on other hosts, so it is quite rare. Which means you definitely don't want to debug this on the customer side.

@hjelmn
Copy link
Member

hjelmn commented Jun 24, 2016

But for sure we can not have a lock in this function if there is any way to avoid it.

@artpol84
Copy link
Contributor Author

@hjelmn, the lock was there originally and I just proposed comparable solution.

@artpol84
Copy link
Contributor Author

@hjelmn, I described how this race appears in details in #1813. Please have a look.
Right now I have 2500 successful runs (still running) of overlap test compared to 5-10 before the fix.

@artpol84
Copy link
Contributor Author

Ha! Jenkins likes my patch!

@artpol84
Copy link
Contributor Author

I will update the PR with fastpath during the weekend.

@hjelmn
Copy link
Member

hjelmn commented Jun 24, 2016

@artpol84 The lock is fundamentally different with your change. If a MPI_Wait* call is waiting on multiple requests then the lock is now obtained one each and every request completion instead of just the last one. This will reduce multi-threaded message rates.

@artpol84
Copy link
Contributor Author

No, with this solution you will only acquire the lock if sync->count will turn to 0, which means that only the last thread will be affected by the lock.
At least this is what I intend to implement.

@artpol84
Copy link
Contributor Author

We go to this fastpath:

  // Fast path:
    if(  (sync->count - updates > 0) && OPAL_LIKELY(OPAL_SUCCESS == status) ){
        if( OPAL_ATOMIC_CMPSET_X(&sync->count, sync->count, sync->count - updates) ){
            return;
        }
    }

if the count won't turn to 0.
And we will fall through if this will be the last step for the sync->count

@artpol84 artpol84 added the bug label Jun 24, 2016
@artpol84 artpol84 added this to the v2.0.0 milestone Jun 24, 2016
@hjelmn
Copy link
Member

hjelmn commented Jun 24, 2016

Having a cmpset will not help things if there is contention for the update. It will fail and be another locking atomic slowing the critical path. Not sure we can do better but I want to give it some thought.

@jladd-mlnx
Copy link
Member

I vote we merge.

@artpol84
Copy link
Contributor Author

We can do iterations there, this should fix most cases.

@hjelmn
Copy link
Member

hjelmn commented Jun 24, 2016

@jladd-mlnx Please back off.

@jladd-mlnx
Copy link
Member

@hppritcha What's your vote?

@jladd-mlnx
Copy link
Member

@jsquyres what's your vote? This fixes the issue. We spent a lot of time debugging this at the community's request.

@artpol84
Copy link
Contributor Author

@hjelmn, I think this will solve your concerns:

static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
{
    // Fast path:
    while(  (sync->count - updates > 0) && OPAL_LIKELY(OPAL_SUCCESS == status) ){
        if( OPAL_ATOMIC_CMPSET_X(&sync->count, sync->count, sync->count - updates) ){
            return;
        }
    }

    // Slow path:
   WAIT_SYNC_LOCK(sync);
    if( OPAL_LIKELY(OPAL_SUCCESS == status) ) {
        if( 0 != (OPAL_THREAD_ADD32(&sync->count, -updates)) ) {
            goto unlock;
        }
    } else {
        /* this is an error path so just use the atomic */
        opal_atomic_swap_32 (&sync->count, 0);
        sync->status = OPAL_ERROR;
    }
    WAIT_SYNC_SIGNAL_UNLOCK(sync);
    return;
unlock:
    WAIT_SYNC_UNLOCK(sync);
}

@artpol84
Copy link
Contributor Author

I spent 2 days debugging your code, @hjelmn. I want to own this fix

@hjelmn
Copy link
Member

hjelmn commented Jun 24, 2016

@artpol84 Not my code.

@jladd-mlnx
Copy link
Member

@hjelmn Then why the resistance to merge?

@artpol84
Copy link
Contributor Author

$ git bisect good
b4ff061bd45705eb39a91976fad855315ba990cc is the first bad commit
commit b4ff061bd45705eb39a91976fad855315ba990cc
Author: Nathan Hjelm <[email protected]>
Date:   Wed May 25 15:42:53 2016 -0600

    pml/yalla: update for request changes

    This commit brings the pml/yalla component up to date with the request
    rework changes.

    Signed-off-by: Nathan Hjelm <[email protected]>

    (cherry picked from open-mpi/ompi@9d439664f074a690ff8c550f9ef489bd70520a0f)

    Signed-off-by: Nathan Hjelm <[email protected]>

:040000 040000 c8047a13774043cf31793730b23f9fcc3cfeb1d6 38e5a2e275229f11f92612f7f1bc541cd9147017 M      ompi

@gpaulsen
Copy link
Member

@nysal - can you please review and comment also?

@artpol84
Copy link
Contributor Author

@gpaulsen Considering updates I mentioned in the comments.
Any change needs to be passed through my jenkins anyway (the only host that provides reproduction).

@artpol84
Copy link
Contributor Author

@rhc54 sure!

@artpol84
Copy link
Contributor Author

@jsquyres, agree.

@jladd-mlnx
Copy link
Member

Once again, Thanks Mellanox Jenkins.

@artpol84
Copy link
Contributor Author

Let me update and test the code. And let's discuss afterwards.
This needs to go to the nearest release. What are the deadlines?

@hjelmn
Copy link
Member

hjelmn commented Jun 24, 2016

This is sufficient to defeat the race:

diff --git a/opal/threads/wait_sync.h b/opal/threads/wait_sync.h
index b29f01c..1e9b0e9 100644
--- a/opal/threads/wait_sync.h
+++ b/opal/threads/wait_sync.h
@@ -24,6 +24,7 @@ typedef struct ompi_wait_sync_t {
     pthread_mutex_t lock;
     struct ompi_wait_sync_t *next;
     struct ompi_wait_sync_t *prev;
+    bool signalling;
 } ompi_wait_sync_t;

 #define REQUEST_PENDING        (void*)0L
@@ -33,14 +34,16 @@ typedef struct ompi_wait_sync_t {

 #define WAIT_SYNC_RELEASE(sync)                       \
     if (opal_using_threads()) {                       \
-       pthread_cond_destroy(&(sync)->condition);      \
-       pthread_mutex_destroy(&(sync)->lock);          \
+        while ((sync)->signalling);                   \
+        pthread_cond_destroy(&(sync)->condition);     \
+        pthread_mutex_destroy(&(sync)->lock);         \
     }

 #define WAIT_SYNC_SIGNAL(sync)                        \
     if (opal_using_threads()) {                       \
         pthread_mutex_lock(&(sync->lock));            \
         pthread_cond_signal(&sync->condition);        \
+        sync->signalling = false;                     \
         pthread_mutex_unlock(&(sync->lock));          \
     }

@@ -61,6 +64,7 @@ static inline int sync_wait_st (ompi_wait_sync_t *sync)
         (sync)->next = NULL;                                    \
         (sync)->prev = NULL;                                    \
         (sync)->status = 0;                                     \
+        (sync)->signalling = false;                             \
         if (opal_using_threads()) {                             \
             pthread_cond_init (&(sync)->condition, NULL);       \
             pthread_mutex_init (&(sync)->lock, NULL);           \
@@ -75,6 +79,7 @@ static inline int sync_wait_st (ompi_wait_sync_t *sync)
  */
 static inline void wait_sync_update(ompi_wait_sync_t *sync, int updates, int status)
 {
+    sync->signalling = true;
     if( OPAL_LIKELY(OPAL_SUCCESS == status) ) {
         if( 0 != (OPAL_THREAD_ADD32(&sync->count, -updates)) ) {
             return;

Fixed a couple typos.

@artpol84
Copy link
Contributor Author

@hjelmn have you verified that?

@hjelmn
Copy link
Member

hjelmn commented Jun 24, 2016

@artpol84 There is no way it wouldn't. The race is the sync getting destroyed before the signaling is finished. By marking it as being in the process of being signaled it should be clear without using a single extra atomic. Extra atomics and locks were my resistance to the fix as presented. Otherwise it was fantastic work.

@jsquyres
Copy link
Member

Once again, thanks to the overall community for developing a great code base with contributions from many parties. None of it would be possible without _all_ of us: those who write the code, those who review the code, those who test the code, those who track down / fix bugs in the code, ...etc.

I think the technical discussions today have been excellent; I appreciate @artpol84 and @hjelmn working to figure out the root cause and come up with a good solution that works for everyone.

@hppritcha
Copy link
Member

@artpol84 could you go ahead with the update your propose. Then have @bosilca review it.

@artpol84
Copy link
Contributor Author

@hppritcha ok, will do that tomorrow. Too late for that now.

@jsquyres
Copy link
Member

Cross reference for github connection: #1816 is where @hjelmn made a PR with his suggestion to run it through the Mellanox Jenkins.

@artpol84 artpol84 force-pushed the fix_req branch 2 times, most recently from 127f701 to 689199c Compare June 25, 2016 07:21
@artpol84
Copy link
Contributor Author

@hppritcha I have updated the PR as we agreed. Under stress testing now, will update later.

@artpol84 artpol84 force-pushed the fix_req branch 2 times, most recently from 6e60cf7 to 65f222b Compare June 25, 2016 07:29
@artpol84
Copy link
Contributor Author

Sounds like rdmacm problem @hjelmn was mentioning. However I have the fix for it here.

@artpol84
Copy link
Contributor Author

bot:retest

@artpol84
Copy link
Contributor Author

@jladd-mlnx @bosilca @hjelmn
After investigation of my original approach it turns out that indeed @hjelmn approach better in cases of concurrent access.
But I can suggest the new version that is using existing sync->count instead of introducing the new variable. The #1815 is updated.
Advantages:

@hjelmn
Copy link
Member

hjelmn commented Jun 26, 2016

@artpol84 Nice. That will do it.

@hjelmn
Copy link
Member

hjelmn commented Jun 26, 2016

BTW, can you put the continue statement and the comment above WAIT_SYNC_RELEASE from my commit. My copyright is already up-to-date on this file.

@artpol84
Copy link
Contributor Author

Ok, will do tomorrow

понедельник, 27 июня 2016 г. пользователь Nathan Hjelm написал:

BTW, can you put the continue statement and the comment above
WAIT_SYNC_RELEASE from my commit. My copyright is already up-to-date on
this file.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1815 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHL5PoO6TAzYR3BN4-O20j5bXWkxL-cTks5qPtUBgaJpZM4I-Cev
.


Best regards, Artem Polyakov
(Mobile mail)

@bosilca
Copy link
Member

bosilca commented Jun 26, 2016

@artpol84 I have a strong doubt about this patch. The count can be negative for waitany and waitsome.

@hjelmn
Copy link
Member

hjelmn commented Jun 27, 2016

@bosilca is right. The count can go negative in waitany and waitsome. There may be no way to do this without an extra member.

@artpol84
Copy link
Contributor Author

@bosilca thank you. I was mostly consider wait_all case coz I was debugging it. I had suspection that this might not work for other.
I'm closing this PR and let's stick to #1816.

@artpol84 artpol84 closed this Jun 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants