Skip to content

misc fixes for heterogeneous cluster support #2940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 12, 2017

Conversation

ggouaillardet
Copy link
Contributor

Refs. #2838

@@ -470,7 +470,8 @@ int32_t opal_convertor_set_position_nocheck( opal_convertor_t* convertor,
} \
} \
convertor->remote_size *= convertor->count; \
convertor->use_desc = &(datatype->desc); \
Copy link
Member

@bosilca bosilca Feb 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the convertor use_desc points to the optimized description, which is a description without datatype information (aka. not suitable for heterogeneous operations). I do think that forcing the switch to the default description (which contains the datatype information) is the right thing here. However, I would remove this line from the OPAL_CONVERTOR_COMPUTE_REMOTE_SIZE macro and instead put it right in the if( ((convertor->flags &... in the OPAL_CONVERTOR_PREPARE macro (line 529).

@hppritcha
Copy link
Member

SS Botany Bay OS-X system doesn't like this PR:

08:36:37   CC       opal_copy_functions_heterogeneous.lo
08:36:37 opal_copy_functions_heterogeneous.c:18:10: fatal error: 'ieee754.h' file not found
08:36:37 #include <ieee754.h>
08:36:37          ^
08:36:37 1 error generated.
08:36:37 make[2]: *** [opal_copy_functions_heterogeneous.lo] Error 1
08:36:37 make[1]: *** [install-recursive] Error 1

@ggouaillardet ggouaillardet force-pushed the topic/hetero_fixes branch 3 times, most recently from 27c3f87 to c05a307 Compare February 10, 2017 01:33
@ggouaillardet
Copy link
Contributor Author

:bot:mellanox:retest

@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/8281105948edfccf8ea59af74273bbb1

@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/28eea78fe6bd6968a2b5c9bc31d2b848

@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/35006c2fefb19f75a7f9b01101d61267

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/0cabf9fbf21623f5d78e0576a6a69de7

@ggouaillardet
Copy link
Contributor Author

@bosilca i updated the PR (there were quite some changes in datatype handling)

could you please give it a final review before i merge ?

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information should already be in the convertor flags (CONVERTOR_SEND).

long double*to = (long double *) to_p;

for (i=0; i<count; i++, to++) {
if ((opal_local_arch&OPAL_ARCH_LDISINTEL) && !(remoteArch&OPAL_ARCH_LDISINTEL)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an invariant of the loop, you might get better results if you move it out of the loop.

@@ -85,9 +143,15 @@ copy_##TYPENAME##_heterogeneous(opal_convertor_t *pConvertor, uint32_t count,
(opal_local_arch & OPAL_ARCH_ISBIGENDIAN)) { \
if( (to_extent == from_extent) && (to_extent == sizeof(TYPE)) ) { \
opal_dt_swap_bytes(to, from, sizeof(TYPE), count); \
if (LONG_DOUBLE) { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is LONG_DOUBLE ? Why do you execute the 2 swaps ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the LONG_DOUBLE argument of the macro.

@@ -122,11 +189,17 @@ copy_##TYPENAME##_heterogeneous(opal_convertor_t *pConvertor, uint32_t count,
\
if ((pConvertor->remoteArch & OPAL_ARCH_ISBIGENDIAN) != \
(opal_local_arch & OPAL_ARCH_ISBIGENDIAN)) { \
if( (to_extent == from_extent) && (to_extent == sizeof(TYPE)) ) { \
if( (to_extent == from_extent) && (to_extent == (2 * sizeof(TYPE))) ) { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch !

@@ -46,6 +46,10 @@ ddt_pack_LDADD = \
$(top_builddir)/ompi/lib@[email protected] \
$(top_builddir)/opal/lib@[email protected]

ddt_pack_hetero_SOURCES = ddt_pack_hetero.c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 tests should only be compiled in if the heterogeneous supports is enabled for the build.

{
opal_datatype_t* datatype = (opal_datatype_t*)pConvertor->pDesc;

pConvertor->remote_size = pConvertor->local_size;
if( OPAL_UNLIKELY(datatype->bdt_used & pConvertor->master->hetero_mask) ) {
pConvertor->flags &= (~CONVERTOR_HOMOGENEOUS);
pConvertor->use_desc = &(datatype->desc);
if (!(send && pConvertor->flags & OPAL_DATATYPE_FLAG_CONTIGUOUS)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information should already be in the convertor flags (CONVERTOR_SEND).

@ggouaillardet
Copy link
Contributor Author

@bosilca i made the requested changes.
the ddt_pack_hetero test has been removed since this PR was created, and unpack_hetero works regardless --enable-heterogeneous is set or not

@bosilca
Copy link
Member

bosilca commented Jul 10, 2017

One last question. I noticed you changed the prototypes of the PMPI functions in ompi/mpi/fortran/mpif-h/prototypes_mpi.h. Does this change breaks our ABI ?

@@ -1,6 +1,6 @@
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
* Copyright (c) 2014-2016 Research Organization for Information Science
* Copyright (c) 2014-2017 Research Organization for Information Science
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you change anything in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will double check that, maybe there used to be a fix that has already been merged or made obsolete by a revamp.

@ggouaillardet
Copy link
Contributor Author

@bosilca, the changes involves the prototype of a function pointer.
so there are several ways to see this.

  1. this breaks ABI
  2. this does not break ABI (the pointer size did not change)
  3. current implementation is broken anyway, so this really a bug fix, and since the pointer size did not change, this is an acceptable commit

i'd rather go with the latter option.

@bosilca
Copy link
Member

bosilca commented Jul 11, 2017

We've been extremely careful not to break the ABI in the middle of a series. I would also tend to go with your latter option, but the RM should be aware of the possible ABI divergence.

@ggouaillardet
Copy link
Contributor Author

got it, i will fix the unnecessary copyright change pointed by @jsquyres and merge this tomorrow into master. then i will PR to v3.0.x so we will hopefully have this ready for v3.0.0
then i will think if these changes are needed for the v2 branches

so no conversion is required when heterogeneous mode is enabled

Signed-off-by: Gilles Gouaillardet <[email protected]>
we now have 12 cases to deal (4 writers and 3 readers) :

1. C `void*` is written into the attribute value, and the value is read into a C `void*` (unity)
2. C `void*` is written, Fortran `INTEGER` is read
3. C `void*` is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read
4. Fortran `INTEGER` is written, C `void*` is read
5. Fortran `INTEGER` is written, Fortran `INTEGER` is read (unity)
6. Fortran `INTEGER` is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read
7. Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is written, C `void*` is read
8. Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is written, Fortran `INTEGER` is read
9. Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read (unity)
10. Intrinsic is written, C `void*` is read
11. Intrinsic is written, Fortran `INTEGER` is read
12. Intrinsic is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read

MPI-2 Fortran "integer representation" has type `INTEGER(KIND=MPI_ADDRESS_KIND)` as clarified
at mpiwg-rma/rma-issues#1

Signed-off-by: Gilles Gouaillardet <[email protected]>
between ieee 754 quadruple precision and extended precision formats.

Signed-off-by: Gilles Gouaillardet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants