-
Notifications
You must be signed in to change notification settings - Fork 900
btl/smcuda: Add atomic_wmb() before sm_fifo_write #12338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change fixes open-mpi#12270 Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures. The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data. Signed-off-by: Luke Robison <[email protected]>
/* memory barrier: ensure writes to the hdr have completed */ \ | ||
opal_atomic_wmb(); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between barrier before the write vs after?
Typically I see the barrier after the actual write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I debated myself as well.
MCA_BTL_SMCUDA_FIFO_WRITE is called from several places. Each of these would need an update to include a write barrier, and any new call would need to include that as well. For this reason I felt it best to embed the barrier into the fifo_write as a "make sure hdr is committed" sort of step rather than relying on all functions which fill the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write in question here is not the integration of the header into the list but the writes related to the content of the header. This look good to me.
Looking at the linked code it appears that the current firo_write also has a write barrier after the item integration. I can't figure out why we need that one.
It is worth noting, that btl/sm has an existing write memory barrier in its fifo_write which suggests btl/sm doesn't need its own patch. It also has a trailing wmb that is missing in smcuda, but I can't justify it so I'll leave that difference alone. I was not able to reproduce the issue on main because I cannot trigger smcuda to be used on host memory, however the code is similar enough I believe this fix is applicable to main, v5.0.x and v4.1.x |
@lrbison Is this ready to merge? |
Yes. I've been testing it continually in the background for the last 24 hours with 0 failures. It is ready. |
I will create backports |
This change fixes #12270
Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures.
The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data.