Skip to content

Commit 71f378d

Browse files
committed
btl/smcuda: Add atomic_wmb() before sm_fifo_write
This change fixes #12270 Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures. The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data. Signed-off-by: Luke Robison <[email protected]>
1 parent 7fc4535 commit 71f378d

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

opal/mca/btl/smcuda/btl_smcuda_fifo.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,8 @@ static void add_pending(struct mca_btl_base_endpoint_t *ep, void *data, bool res
8585
#define MCA_BTL_SMCUDA_FIFO_WRITE(endpoint_peer, my_smp_rank, peer_smp_rank, hdr, resend, \
8686
retry_pending_sends, rc) \
8787
do { \
88+
/* memory barrier: ensure writes to the hdr have completed */ \
89+
opal_atomic_wmb(); \
8890
sm_fifo_t *_fifo = &(mca_btl_smcuda_component.fifo[peer_smp_rank][FIFO_MAP(my_smp_rank)]);\
8991
\
9092
if (retry_pending_sends) { \

0 commit comments

Comments
 (0)