Skip to content

btl/smcuda: Add atomic_wmb() before sm_fifo_write #12338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 15, 2024

Conversation

lrbison
Copy link
Contributor

@lrbison lrbison commented Feb 14, 2024

This change fixes #12270

Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data.

This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <[email protected]>
Comment on lines +88 to +89
/* memory barrier: ensure writes to the hdr have completed */ \
opal_atomic_wmb(); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between barrier before the write vs after?

Typically I see the barrier after the actual write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I debated myself as well.

MCA_BTL_SMCUDA_FIFO_WRITE is called from several places. Each of these would need an update to include a write barrier, and any new call would need to include that as well. For this reason I felt it best to embed the barrier into the fifo_write as a "make sure hdr is committed" sort of step rather than relying on all functions which fill the header.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write in question here is not the integration of the header into the list but the writes related to the content of the header. This look good to me.

Looking at the linked code it appears that the current firo_write also has a write barrier after the item integration. I can't figure out why we need that one.

@lrbison
Copy link
Contributor Author

lrbison commented Feb 14, 2024

It is worth noting, that btl/sm has an existing write memory barrier in its fifo_write which suggests btl/sm doesn't need its own patch. It also has a trailing wmb that is missing in smcuda, but I can't justify it so I'll leave that difference alone.

I was not able to reproduce the issue on main because I cannot trigger smcuda to be used on host memory, however the code is similar enough I believe this fix is applicable to main, v5.0.x and v4.1.x

@wenduwan
Copy link
Contributor

@lrbison Is this ready to merge?

@lrbison
Copy link
Contributor Author

lrbison commented Feb 15, 2024

Yes. I've been testing it continually in the background for the last 24 hours with 0 failures. It is ready.

@wenduwan wenduwan merged commit a55e9b2 into open-mpi:main Feb 15, 2024
@lrbison
Copy link
Contributor Author

lrbison commented Feb 15, 2024

I will create backports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

btl smcuda hang in v4.1.5
3 participants