Skip to content

Topic/osc pt2pt fixes #2581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed

Conversation

jjhursey
Copy link
Member

@jjhursey jjhursey commented Dec 14, 2016

Fixes related to Issue #2505

  • Note that this PR is against v2.0.x. That's just to make it easier on our testing. We will file a real PR against master once we have more of this working.
  • With these patches the single threaded (1sided.c) case seems to be working.
  • The multithreaded case is not working. These patches are progress towards that, but there are still outstanding hangs and wrong answers being generated by the multithreaded version of the test (mt_1sided.c).
  • We will keep this PR updated with any progress that is made. We wanted to get this posted to have more eyes on the problem from the community.

I'm going to turn off the CI for a little while:

  • bot:notest (IBM's version)
  • [skip ci] (Default for GHPRB)

 * `ompi_osc_pt2pt_sync_pscw_peer` needs to be protected when called, and
   the `sync` structure needs to be protected in the `start` function.
   - This is to prevent the `ompi_osc_pt2pt_sync_pscw_peer` from processing
     while `start` is also in progress on the same window in two different
     threads - namely the 'progress' thread and the 'main' thread.
   - This seems to happen when the 'main' thread is part way through the
     `start` function then the progress thread starts processing the `post`
     message received from another peer for this window. Both functions try
     to access the `peer_list` portion of the structure and a NULL is stepped
     on in the `ompi_osc_pt2pt_sync_pscw_peer` function.
   - This patch locks the `sync` structure for the duration of the `start`
     and ensures that whenever `ompi_osc_pt2pt_sync_pscw_peer` is called the
     caller is holding the `sync->lock`. This provides exclusivity to this
     structure ensuring that the latter function sees a fully updated structure.
…ing.

 * Otherwise one thread could be in `opal_list_append` while the second
   is in `opal_list_remove_first` in `ompi_osc_pt2pt_progress_pending_acc`
   - The former was holding the `module` lock, while the latter was holding
     the `accumulate` lock so neither was properly protected.
   - This resulted in a segv when there was concurrent access.
 * Mark's patch
 * This condition was introduced in:
   - open-mpi@9444df1
   - Would like to review with Nathan on this patch.
 * put is waiting for `eager_send_active` to become `true`, but (for some reason) when
   incoming post()s or ACKs of lock requests trigger this function it declines to
   transition `eager_send_active` to `true` in the case of Win_lock_all.
 * It is possible for us to get to this selection statement with `tmp_sync`
   set to `REQUEST_COMPLETED` which causes a segv when we try to dereference
   that value inside `wait_sync_update`. So check against both `REQUEST_PENDING`
   and `REQUEST_COMPLETED` before attempting that function.
@jjhursey jjhursey force-pushed the topic/osc-pt2pt-fixes branch from d47dd92 to b2e80a9 Compare December 14, 2016 22:13
 - From Mark
 - This replaces the previous commit, making it more focused:
   - "osc/pt2pt: Fix put hang when used with win_lock_all"
 - Change lor to land in conditional
 - Make sure to grab the lock before waiting in the conditoin
@jjhursey
Copy link
Member Author

We have the single threaded fixes in PR #2590 and PR #2593. We'll post the multi-threading fixes in a separate PR against master for review.

@jjhursey jjhursey closed this Dec 16, 2016
@jjhursey jjhursey deleted the topic/osc-pt2pt-fixes branch March 9, 2021 15:21
awlauria added a commit to awlauria/ompi that referenced this pull request Jun 21, 2022
OpenPMIx changes:

6c9d3dd - Fix pcompress/zlib implementation (open-mpi#2625)
ac0654a - Fix typos.
b5ab80c - gds/shmem: Support multiple local clients.
43d3345 - Checkpoint gds/shmem work.
3f7b1d0 - Add PMIx Standard version info to pmix_info
fa15e0d - Update the PMIx ABI Query attributes to match the standard
dac9573 - Checkpoint gds/shmem work.
92e20e5 - Fix typo in setting hwloc verbosity
dad0fb8 - Some initial valgrind cleanup
c00f97f - TMA: Integrate more custom memory allocator infrastructure.
1286709 - Require flex only when keyval_lex.c is not provided
af5dd49 - Optimize the file descriptor cleanup on OSX
137cc3d - Support colocation of processes
49a28d3 - Initialize pmix_info_t flags when loading
e11fe2a - Save a legacy URI to support older tools
2f13feb - Deduplicate key/value storage for qualified values.
62d675f - Revert "Rename mca.h to pmix_mca.h to avoid include-path confusion"
288eed9 - Add a few missing renames
4d2e2b5 - Rename mca.h to pmix_mca.h to avoid include-path confusion
7c72657 - pmix_fd: cap the max FD to try to close
594303d - configure.ac: update directory space check
5ccd945 - Checkpoint more gds/shmem work.
34a26e8 - configury: do look for sed
4782e9a - Complete support for group-level info
b54565f - Address pedantic compiler warnings (open-mpi#2581)
90c06ff - Checkpoint gds/shmem work. (open-mpi#2579)
3766868 - Silence pedantic compiler warnings. (open-mpi#2580)
958d6bb - Modify the key-index support
ec77b2b - Correctly handle qualified values
b2f14cf - Remove GDS envar disablement
4e664bd - Checkpoint data storage and retrieval work in gds/shmem.

PRRTe changes:

df7d17d0a3 - Support colocation of spawned procs
6a77daafb6 - Fix a few bugs/leaks in the OOB subsystem
551c927741 - Remove stale Java support
48e44f4203 - Avoid PMIx server release of HWLOC topology
9fe286019e - Use fileno_unlocked if available
802d4d4349 - Incremental valgrind improvements
3083bdbec4 - Revert "Optimize the file descriptor cleanup on OSX"
5ee9b5abba - prrte: fix core dump while printing stack-trace
87df79b900 - Optimize the file descriptor cleanup on OSX
f09d98a4f7 - odls: fix alps compilation problem
db194bb1ab - configury: do look for sed
414b4748c2 - configure.ac: update directory space check
4246494bbb - Set the min PMIx version to v4.1.3
8103bff788 - Remove setting of PRTE_MCA_prte_base_help_aggregate.
da12b9da13 - Restore noloop for logging.

Signed-off-by: Austen Lauria <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants