-
Notifications
You must be signed in to change notification settings - Fork 900
misc fixes for heterogeneous cluster support #2940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc fixes for heterogeneous cluster support #2940
Conversation
3bfdb11
to
0dbd4cf
Compare
opal/datatype/opal_convertor.c
Outdated
@@ -470,7 +470,8 @@ int32_t opal_convertor_set_position_nocheck( opal_convertor_t* convertor, | |||
} \ | |||
} \ | |||
convertor->remote_size *= convertor->count; \ | |||
convertor->use_desc = &(datatype->desc); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point the convertor use_desc points to the optimized description, which is a description without datatype information (aka. not suitable for heterogeneous operations). I do think that forcing the switch to the default description (which contains the datatype information) is the right thing here. However, I would remove this line from the OPAL_CONVERTOR_COMPUTE_REMOTE_SIZE macro and instead put it right in the if( ((convertor->flags &...
in the OPAL_CONVERTOR_PREPARE macro (line 529).
SS Botany Bay OS-X system doesn't like this PR:
|
27c3f87
to
c05a307
Compare
c05a307
to
9854c55
Compare
:bot:mellanox:retest |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/8281105948edfccf8ea59af74273bbb1 |
9854c55
to
056067b
Compare
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/28eea78fe6bd6968a2b5c9bc31d2b848 |
056067b
to
1d94e52
Compare
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/35006c2fefb19f75a7f9b01101d61267 |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/0cabf9fbf21623f5d78e0576a6a69de7 |
1d94e52
to
7ab13a5
Compare
7ab13a5
to
3279f10
Compare
@bosilca i updated the PR (there were quite some changes in datatype handling) could you please give it a final review before i merge ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information should already be in the convertor flags (CONVERTOR_SEND).
long double*to = (long double *) to_p; | ||
|
||
for (i=0; i<count; i++, to++) { | ||
if ((opal_local_arch&OPAL_ARCH_LDISINTEL) && !(remoteArch&OPAL_ARCH_LDISINTEL)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an invariant of the loop, you might get better results if you move it out of the loop.
@@ -85,9 +143,15 @@ copy_##TYPENAME##_heterogeneous(opal_convertor_t *pConvertor, uint32_t count, | |||
(opal_local_arch & OPAL_ARCH_ISBIGENDIAN)) { \ | |||
if( (to_extent == from_extent) && (to_extent == sizeof(TYPE)) ) { \ | |||
opal_dt_swap_bytes(to, from, sizeof(TYPE), count); \ | |||
if (LONG_DOUBLE) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is LONG_DOUBLE ? Why do you execute the 2 swaps ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the LONG_DOUBLE argument of the macro.
@@ -122,11 +189,17 @@ copy_##TYPENAME##_heterogeneous(opal_convertor_t *pConvertor, uint32_t count, | |||
\ | |||
if ((pConvertor->remoteArch & OPAL_ARCH_ISBIGENDIAN) != \ | |||
(opal_local_arch & OPAL_ARCH_ISBIGENDIAN)) { \ | |||
if( (to_extent == from_extent) && (to_extent == sizeof(TYPE)) ) { \ | |||
if( (to_extent == from_extent) && (to_extent == (2 * sizeof(TYPE))) ) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch !
test/datatype/Makefile.am
Outdated
@@ -46,6 +46,10 @@ ddt_pack_LDADD = \ | |||
$(top_builddir)/ompi/lib@[email protected] \ | |||
$(top_builddir)/opal/lib@[email protected] | |||
|
|||
ddt_pack_hetero_SOURCES = ddt_pack_hetero.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 tests should only be compiled in if the heterogeneous supports is enabled for the build.
opal/datatype/opal_convertor.c
Outdated
{ | ||
opal_datatype_t* datatype = (opal_datatype_t*)pConvertor->pDesc; | ||
|
||
pConvertor->remote_size = pConvertor->local_size; | ||
if( OPAL_UNLIKELY(datatype->bdt_used & pConvertor->master->hetero_mask) ) { | ||
pConvertor->flags &= (~CONVERTOR_HOMOGENEOUS); | ||
pConvertor->use_desc = &(datatype->desc); | ||
if (!(send && pConvertor->flags & OPAL_DATATYPE_FLAG_CONTIGUOUS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information should already be in the convertor flags (CONVERTOR_SEND).
3279f10
to
1d4342e
Compare
@bosilca i made the requested changes. |
One last question. I noticed you changed the prototypes of the PMPI functions in ompi/mpi/fortran/mpif-h/prototypes_mpi.h. Does this change breaks our ABI ? |
test/datatype/unpack_hetero.c
Outdated
@@ -1,6 +1,6 @@ | |||
/* -*- Mode: C; c-basic-offset:4 ; -*- */ | |||
/* | |||
* Copyright (c) 2014-2016 Research Organization for Information Science | |||
* Copyright (c) 2014-2017 Research Organization for Information Science |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you change anything in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will double check that, maybe there used to be a fix that has already been merged or made obsolete by a revamp.
@bosilca, the changes involves the prototype of a function pointer.
i'd rather go with the latter option. |
We've been extremely careful not to break the ABI in the middle of a series. I would also tend to go with your latter option, but the RM should be aware of the possible ABI divergence. |
got it, i will fix the unnecessary copyright change pointed by @jsquyres and merge this tomorrow into |
1d4342e
to
fd413e6
Compare
This reverts commit open-mpi/ompi@8e25733. Signed-off-by: Gilles Gouaillardet <[email protected]>
so no conversion is required when heterogeneous mode is enabled Signed-off-by: Gilles Gouaillardet <[email protected]>
Signed-off-by: Gilles Gouaillardet <[email protected]>
Signed-off-by: Gilles Gouaillardet <[email protected]>
we now have 12 cases to deal (4 writers and 3 readers) : 1. C `void*` is written into the attribute value, and the value is read into a C `void*` (unity) 2. C `void*` is written, Fortran `INTEGER` is read 3. C `void*` is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read 4. Fortran `INTEGER` is written, C `void*` is read 5. Fortran `INTEGER` is written, Fortran `INTEGER` is read (unity) 6. Fortran `INTEGER` is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read 7. Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is written, C `void*` is read 8. Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is written, Fortran `INTEGER` is read 9. Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read (unity) 10. Intrinsic is written, C `void*` is read 11. Intrinsic is written, Fortran `INTEGER` is read 12. Intrinsic is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read MPI-2 Fortran "integer representation" has type `INTEGER(KIND=MPI_ADDRESS_KIND)` as clarified at mpiwg-rma/rma-issues#1 Signed-off-by: Gilles Gouaillardet <[email protected]>
between ieee 754 quadruple precision and extended precision formats. Signed-off-by: Gilles Gouaillardet <[email protected]>
Signed-off-by: Gilles Gouaillardet <[email protected]>
Signed-off-by: Gilles Gouaillardet <[email protected]>
fd413e6
to
7a866f7
Compare
Refs. #2838