-
Notifications
You must be signed in to change notification settings - Fork 8
nonblocking reductions in Fortran with non-contiguous buffers of different layouts #663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@RolfRabenseifner do you have any thoughts here? |
see mpi-forum/mpi-issues#663 we will need to more work for this Signed-off-by: Jeff Hammond <[email protected]>
Noadays, I expect that many MPI libraries define MPI_SUBARRAYS_SUPPORTED (at least in the mpi_f08 module) as .TRUE. . |
Please read it again carefully. The thing we have specified is not implementable. You will not find any implementations of this, regardless of what MPI_SUBARRAYS_SUPPORTED says. If you have tests using MPI_I(ALL)REDUCE specifically that work, please share the code and which implementation does them correctly. |
Section 6.9.1 of MPI 4.0 states:
So the layout of both buffers must be identical (afaics; are there exceptions with Fortran where the Fortran native type differ from the MPI type description?). Anyway, completion callbacks are in under discussion in the hybrid working group (https://github.com/mpiwg-hybrid/mpi-standard/pull/1). |
Thanks. The full text there is useful.
We made a mistake in overlooking the fact that Fortran The good news is that I can solve this in VAPAA with generalized requests, which only works with If I can do it with generalized requests, implementations should be able to do it internally without using threads, although I do not expect to see support for this any time soon. Even though it is technically possible to implement what we have specified, it's not possible to implement it in a reasonable way and there is no value to our user community to insisting upon something that nobody is ever going to implement. We should therefore add a restriction to the above text requiring that the relative memory layouts of the input and output buffer in Fortran be the same, since that is consistent with the intent of current text if reinterpreted from the meaning in C. |
I don't think we made a mistake, the memory layout as defined by the standard assumes a flat and contiguous memory addressing. I can imagine how some MPI datatype concepts (such as extent and bounds) are supposed to work with Fortran slices and subarrays, but getting it right seems extremely complicated and error-prone for anything but predefined types. Should we prohibit the use of slices/subarrays with anything but predefined types ? My Fortran knowledge being extremely basic I wrote a small code to check what the Fortran compiler does when passing slices to a C function (defined as external), and it appears that is creates temporaries, copy the Fortran slice into and then pass these temporaries to the C function. This makes sense for blocking functions, but seems like a bad approach for non-blocking, for both buffers. How does the Fortran compiler know when should the temporary of the input buffer be released or when the output buffer should be scattered back into the Fortran slice/subarray and release ? |
I wonder whether supporting layout information of Fortran slices in MPI communication calls is the right approach after all. It breaks the assumption that the MPI datatype is the sole description of the layout of a piece of memory passed to an MPI function. IMHO, a cleaner approach would have been some automatic inference of an MPI datatype from a provided slice, a function that would be unique to the Fortran interface and would make data movement through a C MPI wrapper easier and consistent with the C semantics. But that might just be my Fortran-agnostic ignorance and is probably off-topic :)
If you need progress in generalized requests you may want to look at their extended form [1], which come with a progress callback. Both MPICH and Open MPI have an implementation of them, although Open MPI does not expose the public API (I can dig up how I used it). It's a shame that the extension never made it into the standard. They've been quite useful for me in the past to stitch together a sequence of dependent operations. Maybe we should revisit them at some point... [1] https://link.springer.com/chapter/10.1007/978-3-540-75416-9_33 |
In principle, blocking MPI routines always worked with any actual buffer with mpif.h and both the mpi and the mpi_f08 module with "old" compiler that did not supported TS 29113. I hope that this background info helps a bit to sort out whether there are still some problems. |
@bosilca you need to look at Fortran 2018 CFI_cdesc_t when passed to TYPE(*), DIMENSION(..), ASYNCHRONOUS. That's the only way to get the buffer directly. This was added to Fortran 2018 specifically for MPI. Even though I figured out how to implement it, I agree that subarrays plus user datatypes is horrific, especially in the nonblocking case, and we should at least strongly discourage it as advice to users. |
This is terrible. Implementations have invested plenty of effort to optimize data packing into bounded transfer buffers and partial overlapping of packing and transfers. Apparently, MPI 3.0 introduced a way to force implementations (or the compiler) to allocate unbounded temporary buffers and copy all necessary elements before returning from So, can we instead have an interface like this? type(MPI_Request) :: R
type(MPI_Datatype)::T
integer, dimension(200) :: A
integer, dimension(200) :: B
MPI_Type_create_vector_from_slice(B(1:200:2), T)
MPI_Type_commit(T)
MPI_Iallreduce(A, B, 1, T, MPI_SUM, MPI_COMM_WORLD, R)
MPI_Type_free(T)
MPI_Wait(R, MPI_STATUS_IGNORE) Benefits:
I'm not a Fortran developer but this inconsistency in the API bothers me. |
I wrote code that creates datatypes from subarrays (MPICH has it too). That's a better way to do it, but not what we specified in MPI-3. However, buffering need not be unbounded. Implementations can copy the CFI_cdesc_t and unwind it internally. It's just not done today and nobody wants to do it. What I want instead is MPI_Type_get_typemap(_size) that produce the flattened datatype representation in memory. This is what I need to process subarrays plus datatypes at the same time. MPI IO implementations have this flattening code. It's possible to write it today but it's super tedious due to recursive calls to type envelope. |
If you want this datatype routine, you can just use type_create_subarray. It's better because it can capture the entire layout. The (h)vector approach is relative to the subarray start, not the array start, which means it's not possible to create a subarray datatype that corresponds to the parent array. At the very least, I should write advice to users about this in the Fortran chapter. |
It seems that MPICH already has what I want: pmodels/mpich#6139 @hzhou is there any plan to propose to standardize this in MPI-5? It would be really useful. Reductions with different subarray layouts and user-defined ops are still impossible though. |
Yes, we added the extension believing it is useful for applications to have more direct access to datatypes. But before the formal proposal, we'd like to invite users and collect feedback. You are very welcome to try the new API and please send us your field notes. |
@hzhou as you probably know, I have found it incredibly useful in Vapaa, and it would be great to have this in MPI-Next-Next. |
From 6.9:
I have convinced myself that it is valid to do the dumbest possible implementation that loops over the noncontiguous inputs element-by-element. Doing element-by-element is much easier with MPICH's MPIX_Iov than the MPI datatypes engine, so I want this issue to stay open to capture the need for that capability to be standardized. |
Problem
This is almost impossible to implement:
In MPICH and VAPAA, non-contiguous Fortran subarrays are supported by creating a datatype corresponding to the
CFI_cdesc_t
coming from Fortran (e.g. MPICH implementation).In most MPI functions, there is one datatype for every buffer. However, for reductions, there is only one datatype, so there is no way to capture the layout information of both the input and output buffers, if they are different.
Furthermore, if we are creating a custom datatype, we have to use a custom reduction operator / function.
MPI_User_function
has only one datatype argument, so again, it is impossible to carry along the required layout information.Obviously, in blocking functions, we can allocate temporary buffers and make contiguous copies where necessary, but in the non-blocking case, we can't free the buffer since we don't have completion callbacks.
Proposal
I prefer Option 3...
Option 1 - completion callbacks (add stuff to the standard)
I can solve the nonblocking problem with completion callbacks that allow me to cleanup temporaries. This is a very general solution that has lots of use cases, but the Forum seems to be opposed to it.
In the blocking case, we don't have to do anything.
Option 2 - implementations are very complicated (no changes to the standard)
Implementations that do something far more complicated that what VAPAA and MPICH do right now can solve this, but it is not pretty. They have to pass the CFI information down in to the implementation of reductions and handle different layouts, or they have allocate temporaries and clean them up using an internal mechanism. I suspect implementations have the capability to do the latter already and would go that route, if only because most MPI implementations do not want to deal with
CFI_cdesc_t
any more than absolutely necessary.Option 3 - prohibit this usage (backwards-incompatible changes to the standard)
The easy solution is for us to add a backwards-incompatible restriction that reductions require Fortran buffers to have equivalent layouts. This is only technically backwards-incompatible, because nobody supports this today (at least in the nonblocking case - the blocking case might work due to implicit contiguous copy-in and copy-out, which Fortran compilers do when they see the
CONTIGUOUS
attribute).I will argue that we implicitly require this anyways by virtue of having only one datatype argument, which means that users cannot pass buffers with different layouts from C. It is only because of the invisible layout differences associated with Fortran 2018 that users can do this.
Changes to the Text
Option 3 would add text to state that users are required to pass Fortran buffers of equivalent shape.
We need to be careful about how we say "equivalent shape" because one can have identical memory layouts corresponding to different Fortran shapes, and we only need to constrain the former.
Impact on Implementations
Option 3 requires no implementation changes.
Impact on Users
Users are no longer allowed to do crazy things that are at best unreliable today.
References and Pull Requests
The text was updated successfully, but these errors were encountered: