Skip to content

Add simple opal atomic store/load? #9722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gkatev opened this issue Dec 3, 2021 · 14 comments
Open

Add simple opal atomic store/load? #9722

gkatev opened this issue Dec 3, 2021 · 14 comments
Assignees

Comments

@gkatev
Copy link
Contributor

gkatev commented Dec 3, 2021

Hi, I'm looking through OPAL's atomic utilities (opal/include/opal/sys/atomic.h), and I'm noticing that while there's a large collection of atomic operations, there exist no simple store/load functions (eg. opal_atomic_store or opal_atomic_load).

Would it be a good idea / is it possible to add these operations?

I can see some alternative solutions, like using another operation (eg. opal_atomic_add_fetch -- but what about performance?), or rolling my own version (gcc/asm), but it definitely sounds nice if the solution was integrated in opal's atomic library (?).

@devreal
Copy link
Contributor

devreal commented Dec 3, 2021

I think the assumption is that single-elements loads and stores are atomic, yet not synchronizing. OPAL provides opal_atomic_rmb and opal_atomic_wmb to order loads and stores, respectively. Would that work for your use-case?

@gkatev
Copy link
Contributor Author

gkatev commented Dec 3, 2021

I'm thinking of cases where one process is constantly updating (think sequence number style) a value and another is reading it. (not the case where I want to ensure ordering between two variables' acceses, where the barriers are useful). If the load/stores are not atomic, a partial value could theoretically be received.

I think writing 64-bit values on arm, or writing to non-64-bit-aligned ones on x86 is not atomic.

@devreal
Copy link
Contributor

devreal commented Dec 3, 2021

According to https://developer.arm.com/documentation/ddi0487/gb/, B2.2.1 64bit load/stores are atomic on 64-bit arm. Unaligned 64bit load/store is UB in C so I doubt the compiler will come to the rescue here (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf page 74, 7)

Having said that, I am not opposed to introducing atomic load/store to opal. But what would be the memory ordering? Sequential consistency likely will perform worse than opal_atomic_wmb + store on some architectures. We could have acquire/release semantic for atomic load/store respectively, well documented of course, and replace some of the memory barriers. I'd definitely be interested in examples where that would be feasible in OMPI and in seeing the potential performance impact.

@devreal devreal self-assigned this Dec 3, 2021
@gkatev
Copy link
Contributor Author

gkatev commented Dec 3, 2021

I'd definitely say relaxed ordering. My specific use-case is about the closest thing I can get to ordinary C writes/reads, without being prone to errors.

A variant with other-than-relaxed ordering could also be created, as is also bound to be useful somewhere.

Edit: Another point where simple writes/reads may be problematic are 64-bit writes on 32-bit x86 (?). To be honest, for my case I only need atomic loads/stores for size_t, so it's possible (as it turns out!) that for the major architectures I'm absolutely fine without explicit atomics. In any way, it would be nice for future insterested parties if these atomics where implemented, even if just via a simple C assignment (or __atomic_store in the gcc version), so that looking up information for each architecture will not be necesarry.

@bwbarrett
Copy link
Member

I think we'd be open to a pull request, but I'm dubious they're actually useful. We still support compilers with limited atomic support and there's not a great way to express what you're asking for without compiler support. Yes, you could provide wrappers which are essentially just load and store, and rely on the behaviors we rely on today, but I'd almost rather we make the developer think about it, rather than letting the developer assume the compiler will do things that it won't do (like enforce all the atomicity rules).

@gkatev
Copy link
Contributor Author

gkatev commented Dec 7, 2021

Given the new (to me!) revelations that the compiler automatically naturally aligns types (?) (since it would be UB if it didn't), which did not occur to me at the time, and since more operations than I thought are "tear-free" (eg. 64-bit ops on arm) -- though I don't have a complete image for archs beyond x86 and arm --, I wouldn't be as hard-pressed to add these.

However, even if just one case exists (64-bit op on 32-bit x86?) where simple assignments do not cut it, the ideally portable code would optimally use explicit atomic operations. I agree that the developer should be aware of what happens behind the scenes, but from an atomics library viewpoint, the role of which is (?) to take some of this burden of the developer, these basic ops seem like a basic component. Of course the additions would also serve as documentation for future eyes.

In any case, it looks to me like they are a good fit and are non-intrusive. For example, there exist a bunch of similar opal_atomic_fetch_op_xx methods, all of which also have relaxed ordering. The implementation (compiler support version) should be simple enough:

static inline int32_t opal_atomic_load_32(opal_atomic_int32_t *addr)
{
    return __atomic_load_n(addr, __ATOMIC_RELAXED);
}
static inline void opal_atomic_store_32(opal_atomic_int32_t *addr, int32_t value)
{
    __atomic_store_n(addr, value, __ATOMIC_RELAXED);
}

Apart from _32 and _64, a _size_t variant could also be added similarly to other ops. Perhpas even an _int one (!), though that one is probably minimally necesary and would stand out like an odd duck :-).

If you also agree, I can go ahead and open a PR and take it from there.

@hjelmn
Copy link
Member

hjelmn commented Dec 7, 2021

Just to note. Open MPI no longer supports 32-bit platforms (x86, ppc32, arm32, etc). Also keep in mind an atomic load can be done as fetch add with a 0 operand and store with atomic swap.

@hjelmn
Copy link
Member

hjelmn commented Dec 7, 2021

Also keep in mind that Open MPI intends to move to using C11 atomics exclusively at some point in the future. Most of the custom atomic code will go away at that point.

@devreal
Copy link
Contributor

devreal commented Dec 7, 2021

I am not opposed to having simple atomic load/stores in opal. However, I don't think there is a use-case for it right now because inter-thread synchronization is synchronized through locks and/or fences, so it would essentially be dead code.

@devreal
Copy link
Contributor

devreal commented Dec 7, 2021

@gkatev If you need such infrastructure you can of course add it to your local branch. If that ever gets upstreamed we will have a use-case for atomic load/store and will include the them. I am also planning to revisit our current use of fences and atomics, if I come across a possible use-case I would sync up with you.

@gkatev
Copy link
Contributor Author

gkatev commented Dec 7, 2021

Yes this is all reasonable, and I've already taken care of my local code, so all is well for me. Since it looks like an overhaul will be taking place at some point I do suggest their future addition! (use case: single-writer-based synchronization, in an numerically incremental manner, without extra ordering flags)

(not sure if we/you want the issue closed -- feel free to do it)

@devreal
Copy link
Contributor

devreal commented Jun 18, 2022

Turns out, all increment/decrement/assignment accesses to an _Atomic variable are atomic with sequential consistency [1]. This nullifies all the efforts to avoid atomic operations if not needed (in the opal_thread_* functions, for example) and adds atomic operations where you wouldn't expect them (OBJ_CONSTRUCT for example, and other places where values are assigned to atomic variables). See the below screenshot of a perf profile of opal_thread_add_fetch_32 where both branches (the atomic on top and the supposedly non-atomic branch at the bottom) in fact use atomic operations on x86 (compiled with GCC 11.2.0):

image

I am working on a patch to introduce OPAL_ATOMIC_RELAXED_LOAD and OPAL_ATOMIC_RELAXED_STORE that map to atomic_load_explicit and atomic_store_explicit with relaxed ordering for C11 atomics and regular assignment for all others implementations. It's a mess though, because finding all instances where atomic variables are assigned and no atomicity is required is tedious...

@gkatev Was that motivating your initial question about loads and stores? This struck me while doing some profiling and I didn't realize that was an issue earlier.

[1] https://en.cppreference.com/w/c/language/atomic

@gkatev
Copy link
Contributor Author

gkatev commented Jun 18, 2022

So, you refer to *addr = *addr operator delta (in OPAL_THREAD_DEFINE_ATOMIC_OP), which when not using threads is meant to be non-atomic (?)

Are there any opal atomics implementations that rely on the implicit ordering of _Atomic? Would it harm not including it in the typedefs of opal_atomic_xxx_t? (or perhaps alternatively not having it when not using threads??)

My initial request would indeed be solved by functions that call atomic_load/store_explicit with relaxed ordering! You mean if my motivation was for opal_atomic_store/load with non-seqcst-implied ordering? From what I remember there where (are?) no opal atomic functions that only store or load (but e.g. there are add_fetch).

My motivation here for the existence of opal_atomic_store/load with relaxed ordering was portability. Because while in most architectures a non-atomic write/read would be equivalent, maybe on some bizarre architecture it won't be (??). Essentially, the same way that one uses __atomic_store_n(&dst, val, __ATOMIC_RELAXED) instead of dst = val, just to be safe, I also looked to use opal_atomic_store(&dst, val).

@devreal
Copy link
Contributor

devreal commented Jun 19, 2022

So, you refer to *addr = *addr operator delta (in OPAL_THREAD_DEFINE_ATOMIC_OP), which when not using threads is meant to be non-atomic (?)

That's right, it's supposed to be non-atomic addition and assignment.

Are there any opal atomics implementations that rely on the implicit ordering of _Atomic? Would it harm not including it in the typedefs of opal_atomic_xxx_t? (or perhaps alternatively not having it when not using threads??)

IIRC, the reason we included _Atomic is so that patterns like the one below are not optimized out:

while (atomic_var == 0) { ... }

Not sure where such patterns are used. Alternatively, using volatile instead (as other atomic backends do) would achieve the same effect in the example above but the canonical way in C11 is _Atomic (and it safes us from casting around).

My understanding is that throughout the code base we rely on explicit memory barriers (opal_atomic_rmb() and opal_atomic_wmb()) to ensure ordering between operations. So no code relies on any ordering guarantees just from loads and stores on _Atomic operations.

IMHO, C11 _Atomic is a mistake and the issue here is just one example. C++20 has finally introduce std::atomic_ref to allow atomic access to non-atomic variables, which really should be the norm. I will post my PR soon and leave it to the community to decide how to proceed. For the record, the difference for a local irecv - send - wait is significant: 0.22us with the superfluous atomic operations vs 0.16us with relaxed load/stores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants