Skip to content

nonblocking reductions in Fortran with non-contiguous buffers of different layouts #663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jeffhammond opened this issue Jan 4, 2023 · 16 comments
Labels
mpi-6 For inclusion in the MPI 5.1 or 6.0 standard wg-fortran Fortran Working Group

Comments

@jeffhammond
Copy link
Member

jeffhammond commented Jan 4, 2023

Problem

This is almost impossible to implement:

  type(MPI_Request) :: R
  integer, dimension(300) :: A
  integer, dimension(200) :: B
  MPI_Iallreduce(A(1:300:3), B(1:200:2), 100, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD, R)
  MPI_Wait(R, MPI_STATUS_IGNORE)

In MPICH and VAPAA, non-contiguous Fortran subarrays are supported by creating a datatype corresponding to the CFI_cdesc_t coming from Fortran (e.g. MPICH implementation).

In most MPI functions, there is one datatype for every buffer. However, for reductions, there is only one datatype, so there is no way to capture the layout information of both the input and output buffers, if they are different.

Furthermore, if we are creating a custom datatype, we have to use a custom reduction operator / function. MPI_User_function has only one datatype argument, so again, it is impossible to carry along the required layout information.

Obviously, in blocking functions, we can allocate temporary buffers and make contiguous copies where necessary, but in the non-blocking case, we can't free the buffer since we don't have completion callbacks.

Proposal

I prefer Option 3...

Option 1 - completion callbacks (add stuff to the standard)

I can solve the nonblocking problem with completion callbacks that allow me to cleanup temporaries. This is a very general solution that has lots of use cases, but the Forum seems to be opposed to it.

In the blocking case, we don't have to do anything.

Option 2 - implementations are very complicated (no changes to the standard)

Implementations that do something far more complicated that what VAPAA and MPICH do right now can solve this, but it is not pretty. They have to pass the CFI information down in to the implementation of reductions and handle different layouts, or they have allocate temporaries and clean them up using an internal mechanism. I suspect implementations have the capability to do the latter already and would go that route, if only because most MPI implementations do not want to deal with CFI_cdesc_t any more than absolutely necessary.

Option 3 - prohibit this usage (backwards-incompatible changes to the standard)

The easy solution is for us to add a backwards-incompatible restriction that reductions require Fortran buffers to have equivalent layouts. This is only technically backwards-incompatible, because nobody supports this today (at least in the nonblocking case - the blocking case might work due to implicit contiguous copy-in and copy-out, which Fortran compilers do when they see the CONTIGUOUS attribute).

I will argue that we implicitly require this anyways by virtue of having only one datatype argument, which means that users cannot pass buffers with different layouts from C. It is only because of the invisible layout differences associated with Fortran 2018 that users can do this.

Changes to the Text

Option 3 would add text to state that users are required to pass Fortran buffers of equivalent shape.

We need to be careful about how we say "equivalent shape" because one can have identical memory layouts corresponding to different Fortran shapes, and we only need to constrain the former.

Impact on Implementations

Option 3 requires no implementation changes.

Impact on Users

Users are no longer allowed to do crazy things that are at best unreliable today.

References and Pull Requests

@jeffhammond jeffhammond added wg-fortran Fortran Working Group mpi-6 For inclusion in the MPI 5.1 or 6.0 standard labels Jan 4, 2023
@jeffhammond
Copy link
Member Author

@RolfRabenseifner do you have any thoughts here?

jeffhammond added a commit to jeffhammond/vapaa that referenced this issue Jan 4, 2023
see mpi-forum/mpi-issues#663

we will need to more work for this

Signed-off-by: Jeff Hammond <[email protected]>
@RolfRabenseifner
Copy link

Noadays, I expect that many MPI libraries define MPI_SUBARRAYS_SUPPORTED (at least in the mpi_f08 module) as .TRUE. .
With this, your problem is solved.
MPI-3.0 already specidief, if compiled for an Fortran 2018 compiler, MPI_SUBARRAYS_SUPPORTED must be set to true in the mpi_f08 module.
Therefore, I recommend to close this issue.

@jeffhammond
Copy link
Member Author

jeffhammond commented Jan 4, 2023

Please read it again carefully. The thing we have specified is not implementable. You will not find any implementations of this, regardless of what MPI_SUBARRAYS_SUPPORTED says. If you have tests using MPI_I(ALL)REDUCE specifically that work, please share the code and which implementation does them correctly.

@devreal
Copy link

devreal commented Jan 4, 2023

Section 6.9.1 of MPI 4.0 states:

The input buffer is defined by the arguments sendbuf, count and datatype; the output buffer is defined by the
arguments recvbuf, count and datatype

So the layout of both buffers must be identical (afaics; are there exceptions with Fortran where the Fortran native type differ from the MPI type description?).

Anyway, completion callbacks are in under discussion in the hybrid working group (https://github.com/mpiwg-hybrid/mpi-standard/pull/1).

@jeffhammond
Copy link
Member Author

Thanks. The full text there is useful.

The input buffer is defined by the arguments sendbuf, count and datatype; the output buffer is defined by the arguments recvbuf, count and datatype; both have the same number of elements, with the same type. The routine is called by all group members using the same arguments for count, datatype, op, root and comm. Thus, all processes provide input buffers of the same length, with elements of the same type as the output buffer at the root. Each process can provide one element, or a sequence of elements, in which case the combine operation is executed element-wise on each entry of the sequence.

We made a mistake in overlooking the fact that Fortran (count,datatype) do not fully specify the relative memory layout of a buffer the way they do in C. One can have an arbitrarily large number of relative memory layouts associated with a single (count,datatype) in the context of Fortran subarrays. I think it's on the order of SIZE_MAX to the 15th power, in theory, although the subset of those that fit into the available memory of an MPI process is far smaller.

The good news is that I can solve this in VAPAA with generalized requests, which only works with MPI_THREAD_MULTIPLE, but I am willing to accept such limitations in VAPAA because I don't plan to support 100% of the standard.

If I can do it with generalized requests, implementations should be able to do it internally without using threads, although I do not expect to see support for this any time soon.

Even though it is technically possible to implement what we have specified, it's not possible to implement it in a reasonable way and there is no value to our user community to insisting upon something that nobody is ever going to implement. We should therefore add a restriction to the above text requiring that the relative memory layouts of the input and output buffer in Fortran be the same, since that is consistent with the intent of current text if reinterpreted from the meaning in C.

@bosilca
Copy link
Member

bosilca commented Jan 5, 2023

I don't think we made a mistake, the memory layout as defined by the standard assumes a flat and contiguous memory addressing. I can imagine how some MPI datatype concepts (such as extent and bounds) are supposed to work with Fortran slices and subarrays, but getting it right seems extremely complicated and error-prone for anything but predefined types. Should we prohibit the use of slices/subarrays with anything but predefined types ?

My Fortran knowledge being extremely basic I wrote a small code to check what the Fortran compiler does when passing slices to a C function (defined as external), and it appears that is creates temporaries, copy the Fortran slice into and then pass these temporaries to the C function. This makes sense for blocking functions, but seems like a bad approach for non-blocking, for both buffers. How does the Fortran compiler know when should the temporary of the input buffer be released or when the output buffer should be scattered back into the Fortran slice/subarray and release ?

@devreal
Copy link

devreal commented Jan 5, 2023

We made a mistake in overlooking the fact that Fortran (count,datatype) do not fully specify the relative memory layout of a buffer the way they do in C.

I wonder whether supporting layout information of Fortran slices in MPI communication calls is the right approach after all. It breaks the assumption that the MPI datatype is the sole description of the layout of a piece of memory passed to an MPI function. IMHO, a cleaner approach would have been some automatic inference of an MPI datatype from a provided slice, a function that would be unique to the Fortran interface and would make data movement through a C MPI wrapper easier and consistent with the C semantics. But that might just be my Fortran-agnostic ignorance and is probably off-topic :)

The good news is that I can solve this in VAPAA with generalized requests, which only works with MPI_THREAD_MULTIPLE, but I am willing to accept such limitations in VAPAA because I don't plan to support 100% of the standard.

If you need progress in generalized requests you may want to look at their extended form [1], which come with a progress callback. Both MPICH and Open MPI have an implementation of them, although Open MPI does not expose the public API (I can dig up how I used it). It's a shame that the extension never made it into the standard. They've been quite useful for me in the past to stitch together a sequence of dependent operations. Maybe we should revisit them at some point...

[1] https://link.springer.com/chapter/10.1007/978-3-540-75416-9_33

@RolfRabenseifner
Copy link

In principle, blocking MPI routines always worked with any actual buffer with mpif.h and both the mpi and the mpi_f08 module with "old" compiler that did not supported TS 29113.
This means that if the user handed over a strided buffer (like a(1:100:3), i.e. each 3rd element in of a(1:100)) then the compiler copied the strided data in a contiguous scratch array, called the MPI routine and copied the result data in that scratch array back into the original strided array after return from the MPI routine.
Of course the compiler does not care whether it is a blocking or nonblocking MPI routine.
And with nonblocking routines, this handling of strided buffers was broken as reported in MPI-2.0, because the scratch array is removed after the return of the nonblocking routine instead of after the nonblocking request is completed.
Therefore with MPI-3.0, we introduced the new way of Fartran buffer declaration, which now allows that the MPI library does this copying into an internal MPI scratch array and of course the the removal of this internal scratch array as part of the completion (and not as part of the return from the nonblocking call).
Of course, this copying may be optimized (on-the-fly methods), but need not, because it may be seen as a special Fortran feature. If the mpi_f08 module (and the mpi module) are implemented in the naive way, then the MPI_Iallreduce has no problem.

I hope that this background info helps a bit to sort out whether there are still some problems.

@jeffhammond
Copy link
Member Author

@bosilca you need to look at Fortran 2018 CFI_cdesc_t when passed to TYPE(*), DIMENSION(..), ASYNCHRONOUS. That's the only way to get the buffer directly. This was added to Fortran 2018 specifically for MPI.

Even though I figured out how to implement it, I agree that subarrays plus user datatypes is horrific, especially in the nonblocking case, and we should at least strongly discourage it as advice to users.

@devreal
Copy link

devreal commented Jan 6, 2023

Therefore with MPI-3.0, we introduced the new way of Fartran buffer declaration, which now allows that the MPI library does this copying into an internal MPI scratch array and of course the the removal of this internal scratch array as part of the completion (and not as part of the return from the nonblocking call).

This is terrible. Implementations have invested plenty of effort to optimize data packing into bounded transfer buffers and partial overlapping of packing and transfers. Apparently, MPI 3.0 introduced a way to force implementations (or the compiler) to allocate unbounded temporary buffers and copy all necessary elements before returning from MPI_Isend. And the user won't know...

So, can we instead have an interface like this?

  type(MPI_Request) :: R
  type(MPI_Datatype)::T
  integer, dimension(200) :: A
  integer, dimension(200) :: B
  MPI_Type_create_vector_from_slice(B(1:200:2), T)
  MPI_Type_commit(T)
  MPI_Iallreduce(A, B, 1, T, MPI_SUM, MPI_COMM_WORLD, R)
  MPI_Type_free(T)
  MPI_Wait(R, MPI_STATUS_IGNORE)

Benefits:

  1. Simplified MPI datatype creation based on a language's native description (could be useful in other languages too).
  2. Only the MPI datatype represents the memory layout and can be used for optimizations by the implementation.
  3. No unbounded temporary buffers.
  4. The MPI datatype can be reused (instead of being recreated on every operation as in the MPICH implementation linked above).
  5. Avoid the confusion about memory layouts in the original example (by making it clear that there can be only one memory layout description). If we want multiple layouts then we should introduce a new set of functions that have different MPI datatypes for input and output buffers (like we have in MPI_Sendrecv).

I'm not a Fortran developer but this inconsistency in the API bothers me.

@jeffhammond
Copy link
Member Author

I wrote code that creates datatypes from subarrays (MPICH has it too). That's a better way to do it, but not what we specified in MPI-3.

However, buffering need not be unbounded. Implementations can copy the CFI_cdesc_t and unwind it internally. It's just not done today and nobody wants to do it.

What I want instead is MPI_Type_get_typemap(_size) that produce the flattened datatype representation in memory. This is what I need to process subarrays plus datatypes at the same time.

MPI IO implementations have this flattening code. It's possible to write it today but it's super tedious due to recursive calls to type envelope.

@jeffhammond
Copy link
Member Author

jeffhammond commented Jan 7, 2023

If you want this datatype routine, you can just use type_create_subarray. It's better because it can capture the entire layout. The (h)vector approach is relative to the subarray start, not the array start, which means it's not possible to create a subarray datatype that corresponds to the parent array.

At the very least, I should write advice to users about this in the Fortran chapter.

@jeffhammond
Copy link
Member Author

jeffhammond commented Jan 8, 2023

It seems that MPICH already has what I want: pmodels/mpich#6139

@hzhou is there any plan to propose to standardize this in MPI-5? It would be really useful.

Reductions with different subarray layouts and user-defined ops are still impossible though.

@hzhou
Copy link

hzhou commented Jan 9, 2023

It seems that MPICH already has what I want: pmodels/mpich#6139

@hzhou is there any plan to propose to standardize this in MPI-5? It would be really useful.

Yes, we added the extension believing it is useful for applications to have more direct access to datatypes. But before the formal proposal, we'd like to invite users and collect feedback. You are very welcome to try the new API and please send us your field notes.

@jeffhammond
Copy link
Member Author

@hzhou as you probably know, I have found it incredibly useful in Vapaa, and it would be great to have this in MPI-Next-Next.

@jeffhammond
Copy link
Member Author

From 6.9:

General datatypes may be passed to the user function. However, use of datatypes that are not contiguous is likely to lead to inefficiencies.

I have convinced myself that it is valid to do the dumbest possible implementation that loops over the noncontiguous inputs element-by-element.

Doing element-by-element is much easier with MPICH's MPIX_Iov than the MPI datatypes engine, so I want this issue to stay open to capture the need for that capability to be standardized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi-6 For inclusion in the MPI 5.1 or 6.0 standard wg-fortran Fortran Working Group
Projects
Status: To Do
Development

No branches or pull requests

5 participants