-
Notifications
You must be signed in to change notification settings - Fork 902
MPI_Waitsome performance improvement #1820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit fixes a race condition discovered by @artpol84. The race happens when a signalling thread decrements the sync count to 0 then goes to sleep. If the waiting thread runs and detects the count == 0 before going to sleep on the condition variable it will destroy the condition variable while the signalling thread is potentially still processing the completion. The fix is to add a non-atomic member to the sync structure that indicates another process is handling completion. Since the member will only be set to false by the initiating thread and the completing thread the variable does not need to be protected. When destoying a condition variable the waiting thread needs to wait until the singalling thread is finished. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1813 Signed-off-by: Nathan Hjelm <[email protected]>
running stress-test againts this now. Will update later. |
432a849
to
a85bd84
Compare
I have few comments about this patch. Introducing WAIT_SYNC_SIGNALLED have the drawback to bring back the original issue. Now the sync primitive might have few requests associated with, but will be released by the creator thread even if one of the progress thread is actually trying to activate the synchronization primitive (because now signaling doesn't protect the path where a thread is still having a hand on the synchronization primitive). I think a simpler solution will be to simply call the sync_update to remove the initial value and if necessary set the signalling field to false). sync_sets isn't equal to count - num_requests_null_inactive - num_requests_done in the first part of wait_some ? and then to count - num_requests_null_inactive - num_requests_done on the second loop ? The patch improving MPI_Waitsome saves one atomic operation for all requests that were originally completed in exchange for increasing the size of the request structure. Why not using the user provided array (indices) instead of altering the request structure ? I dont think there is anything in the standard that prevents us from using the array as temporary buffer, as long as upon return it contains the expected values, and we know it must have the right size (or the application is incorrect). |
|
(request handling related)
@bosilca Ok, let me think more about that and try to consider your suggestions. |
by avoiding extra atomic exchanges. The fix is based on MPI spec: 12.4.2 Multiple threads completing the same request. A program in which two threads block, waiting on the same request, is erroneous. Similarly, the same request cannot appear in the array of requests of two concurrent MPI_{WAIT|TEST}{ANY|SOME|ALL} calls. In MPI, a request can only be completed once. Any combination of wait or test that violates this rule is erroneous." We add marked flag to the request structure. Only MPI_Waitsome thread will use/access it by any means. PML threads will not see/touch it. So given that any particular request can be used no more than in one MPI_Waitsome we are safe to do this change.
Closing this in favor of PR #1821 |
This PR makes sense only if
will need to rebase afterwards.
@bosilca @hjelmn please check.
I'm addressing the following drawback:
https://github.com/hjelmn/ompi/blob/request_perfm_regression/ompi/request/req_wait.c#L412:L417
Still checking but waitsome test seems to pass.