misc fixes for heterogeneous cluster support #2940

ggouaillardet · 2017-02-08T08:25:38Z

bosilca · 2017-02-09T05:47:23Z

opal/datatype/opal_convertor.c

@@ -470,7 +470,8 @@ int32_t opal_convertor_set_position_nocheck( opal_convertor_t* convertor,
            }                                                             \
        }                                                                 \
        convertor->remote_size *= convertor->count;                       \
-        convertor->use_desc = &(datatype->desc);                          \


At this point the convertor use_desc points to the optimized description, which is a description without datatype information (aka. not suitable for heterogeneous operations). I do think that forcing the switch to the default description (which contains the datatype information) is the right thing here. However, I would remove this line from the OPAL_CONVERTOR_COMPUTE_REMOTE_SIZE macro and instead put it right in the if( ((convertor->flags &... in the OPAL_CONVERTOR_PREPARE macro (line 529).

hppritcha · 2017-02-09T16:19:10Z

SS Botany Bay OS-X system doesn't like this PR:

08:36:37   CC       opal_copy_functions_heterogeneous.lo
08:36:37 opal_copy_functions_heterogeneous.c:18:10: fatal error: 'ieee754.h' file not found
08:36:37 #include <ieee754.h>
08:36:37          ^
08:36:37 1 error generated.
08:36:37 make[2]: *** [opal_copy_functions_heterogeneous.lo] Error 1
08:36:37 make[1]: *** [install-recursive] Error 1

ggouaillardet · 2017-02-20T15:08:13Z

:bot:mellanox:retest

ibm-ompi · 2017-04-11T07:16:41Z

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/8281105948edfccf8ea59af74273bbb1

ibm-ompi · 2017-04-11T08:17:53Z

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/28eea78fe6bd6968a2b5c9bc31d2b848

ibm-ompi · 2017-04-11T08:33:21Z

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/35006c2fefb19f75a7f9b01101d61267

ibm-ompi · 2017-04-11T09:34:17Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/0cabf9fbf21623f5d78e0576a6a69de7

ggouaillardet · 2017-07-07T07:20:17Z

@bosilca i updated the PR (there were quite some changes in datatype handling)

could you please give it a final review before i merge ?

bosilca

This information should already be in the convertor flags (CONVERTOR_SEND).

bosilca · 2017-02-09T05:50:06Z

opal/datatype/opal_copy_functions_heterogeneous.c

+    long double*to = (long double *) to_p;
+
+    for (i=0; i<count; i++, to++) {
+        if ((opal_local_arch&OPAL_ARCH_LDISINTEL) && !(remoteArch&OPAL_ARCH_LDISINTEL)) {


This is an invariant of the loop, you might get better results if you move it out of the loop.

bosilca · 2017-02-09T05:53:15Z

opal/datatype/opal_copy_functions_heterogeneous.c

@@ -85,9 +143,15 @@ copy_##TYPENAME##_heterogeneous(opal_convertor_t *pConvertor, uint32_t count,
        (opal_local_arch & OPAL_ARCH_ISBIGENDIAN)) {                    \
        if( (to_extent == from_extent) && (to_extent == sizeof(TYPE)) ) { \
            opal_dt_swap_bytes(to, from, sizeof(TYPE), count);          \
+            if (LONG_DOUBLE) {                                          \


what is LONG_DOUBLE ? Why do you execute the 2 swaps ?

I missed the LONG_DOUBLE argument of the macro.

bosilca · 2017-02-09T05:56:11Z

opal/datatype/opal_copy_functions_heterogeneous.c

@@ -122,11 +189,17 @@ copy_##TYPENAME##_heterogeneous(opal_convertor_t *pConvertor, uint32_t count,
                                                                        \
    if ((pConvertor->remoteArch & OPAL_ARCH_ISBIGENDIAN) !=             \
        (opal_local_arch & OPAL_ARCH_ISBIGENDIAN)) {                    \
-        if( (to_extent == from_extent) && (to_extent == sizeof(TYPE)) ) { \
+        if( (to_extent == from_extent) && (to_extent == (2 * sizeof(TYPE))) ) { \


nice catch !

bosilca · 2017-02-09T05:57:35Z

test/datatype/Makefile.am

@@ -46,6 +46,10 @@ ddt_pack_LDADD = \
        $(top_builddir)/ompi/lib@[email protected] \
        $(top_builddir)/opal/lib@[email protected]

+ddt_pack_hetero_SOURCES = ddt_pack_hetero.c


These 2 tests should only be compiled in if the heterogeneous supports is enabled for the build.

bosilca · 2017-07-07T15:49:57Z

opal/datatype/opal_convertor.c

 {
    opal_datatype_t* datatype = (opal_datatype_t*)pConvertor->pDesc;

    pConvertor->remote_size = pConvertor->local_size;
    if( OPAL_UNLIKELY(datatype->bdt_used & pConvertor->master->hetero_mask) ) {
        pConvertor->flags &= (~CONVERTOR_HOMOGENEOUS);
-        pConvertor->use_desc = &(datatype->desc);
+        if (!(send && pConvertor->flags & OPAL_DATATYPE_FLAG_CONTIGUOUS)) {


This information should already be in the convertor flags (CONVERTOR_SEND).

ggouaillardet · 2017-07-10T08:26:49Z

@bosilca i made the requested changes.
the ddt_pack_hetero test has been removed since this PR was created, and unpack_hetero works regardless --enable-heterogeneous is set or not

bosilca · 2017-07-10T17:14:35Z

One last question. I noticed you changed the prototypes of the PMPI functions in ompi/mpi/fortran/mpif-h/prototypes_mpi.h. Does this change breaks our ABI ?

jsquyres · 2017-07-11T10:45:31Z

test/datatype/unpack_hetero.c

@@ -1,6 +1,6 @@
 /* -*- Mode: C; c-basic-offset:4 ; -*- */
 /*
- * Copyright (c) 2014-2016 Research Organization for Information Science
+ * Copyright (c) 2014-2017 Research Organization for Information Science


Did you change anything in this file?

i will double check that, maybe there used to be a fix that has already been merged or made obsolete by a revamp.

ggouaillardet · 2017-07-11T11:10:53Z

@bosilca, the changes involves the prototype of a function pointer.
so there are several ways to see this.

this breaks ABI
this does not break ABI (the pointer size did not change)
current implementation is broken anyway, so this really a bug fix, and since the pointer size did not change, this is an acceptable commit

i'd rather go with the latter option.

bosilca · 2017-07-11T11:14:22Z

We've been extremely careful not to break the ABI in the middle of a series. I would also tend to go with your latter option, but the RM should be aware of the possible ABI divergence.

ggouaillardet · 2017-07-11T12:03:51Z

got it, i will fix the unnecessary copyright change pointed by @jsquyres and merge this tomorrow into master. then i will PR to v3.0.x so we will hopefully have this ready for v3.0.0
then i will think if these changes are needed for the v2 branches

This reverts commit open-mpi/ompi@8e25733. Signed-off-by: Gilles Gouaillardet <[email protected]>

so no conversion is required when heterogeneous mode is enabled Signed-off-by: Gilles Gouaillardet <[email protected]>

Signed-off-by: Gilles Gouaillardet <[email protected]>

we now have 12 cases to deal (4 writers and 3 readers) : 1. C `void*` is written into the attribute value, and the value is read into a C `void*` (unity) 2. C `void*` is written, Fortran `INTEGER` is read 3. C `void*` is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read 4. Fortran `INTEGER` is written, C `void*` is read 5. Fortran `INTEGER` is written, Fortran `INTEGER` is read (unity) 6. Fortran `INTEGER` is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read 7. Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is written, C `void*` is read 8. Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is written, Fortran `INTEGER` is read 9. Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read (unity) 10. Intrinsic is written, C `void*` is read 11. Intrinsic is written, Fortran `INTEGER` is read 12. Intrinsic is written, Fortran `INTEGER(KIND=MPI_ADDRESS_KIND)` is read MPI-2 Fortran "integer representation" has type `INTEGER(KIND=MPI_ADDRESS_KIND)` as clarified at mpiwg-rma/rma-issues#1 Signed-off-by: Gilles Gouaillardet <[email protected]>

between ieee 754 quadruple precision and extended precision formats. Signed-off-by: Gilles Gouaillardet <[email protected]>

Signed-off-by: Gilles Gouaillardet <[email protected]>

ggouaillardet force-pushed the topic/hetero_fixes branch from 3bfdb11 to 0dbd4cf Compare February 8, 2017 08:27

ggouaillardet mentioned this pull request Feb 8, 2017

Remove the enable-heterogeneous config option #2838

Closed

jsquyres mentioned this pull request Feb 8, 2017

remove --enable-heterogenous configury option #2802

Closed

bosilca reviewed Feb 9, 2017

View reviewed changes

ggouaillardet force-pushed the topic/hetero_fixes branch 3 times, most recently from 27c3f87 to c05a307 Compare February 10, 2017 01:33

ggouaillardet force-pushed the topic/hetero_fixes branch from c05a307 to 9854c55 Compare February 20, 2017 08:03

ggouaillardet force-pushed the topic/hetero_fixes branch from 9854c55 to 056067b Compare April 11, 2017 08:06

ggouaillardet force-pushed the topic/hetero_fixes branch from 056067b to 1d94e52 Compare April 11, 2017 08:22

ggouaillardet force-pushed the topic/hetero_fixes branch from 1d94e52 to 7ab13a5 Compare April 11, 2017 14:18

ggouaillardet force-pushed the topic/hetero_fixes branch from 7ab13a5 to 3279f10 Compare July 7, 2017 05:08

ggouaillardet mentioned this pull request Jul 7, 2017

Detect that we have a mix of BE/LE in the system, provide a warning that OMPI doesn't currently support this environment, and error out #3828

Merged

bosilca reviewed Jul 7, 2017

View reviewed changes

rhc54 mentioned this pull request Jul 7, 2017

Remove --enable-heterogeneous until fix is ready #3835

Merged

ggouaillardet force-pushed the topic/hetero_fixes branch from 3279f10 to 1d4342e Compare July 10, 2017 06:54

jsquyres reviewed Jul 11, 2017

View reviewed changes

ggouaillardet force-pushed the topic/hetero_fixes branch from 1d4342e to fd413e6 Compare July 12, 2017 01:26

ggouaillardet added 8 commits July 12, 2017 10:27

Revert "Remove --enable-heterogeneous until fix is ready"

c36b9e8

This reverts commit open-mpi/ompi@8e25733. Signed-off-by: Gilles Gouaillardet <[email protected]>

oob/tcp: make mca_oob_tcp_msg_type_t an uint8_t

626e94b

so no conversion is required when heterogeneous mode is enabled Signed-off-by: Gilles Gouaillardet <[email protected]>

btl/tcp: fix heterogeneous support for put / large messages

32606ad

Signed-off-by: Gilles Gouaillardet <[email protected]>

opal/ddt: use optimized description when packing contiguous datatypes

9118777

Signed-off-by: Gilles Gouaillardet <[email protected]>

opal/datatype: add minimal support to convert long double

8fd08b9

between ieee 754 quadruple precision and extended precision formats. Signed-off-by: Gilles Gouaillardet <[email protected]>

opal/datatype: fix opal_dt_swap_long_double if no IEEE754_H

a111fc8

Signed-off-by: Gilles Gouaillardet <[email protected]>

topo/treematch: fix topo_treematch_distgraph_create

7a866f7

Signed-off-by: Gilles Gouaillardet <[email protected]>

ggouaillardet force-pushed the topic/hetero_fixes branch from fd413e6 to 7a866f7 Compare July 12, 2017 01:27

ggouaillardet merged commit a71d5c9 into open-mpi:master Jul 12, 2017

This was referenced Jul 13, 2017

v3.0.x: misc fixes for heterogeneous clusters #3871

Merged

ompi/attributes: revamp attribute handling. #3891

Merged

misc fixes for heterogeneous cluster support #2940

misc fixes for heterogeneous cluster support #2940

Uh oh!

Conversation

ggouaillardet commented Feb 8, 2017

Uh oh!

bosilca Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hppritcha commented Feb 9, 2017

Uh oh!

ggouaillardet commented Feb 20, 2017

Uh oh!

ibm-ompi commented Apr 11, 2017

Uh oh!

ibm-ompi commented Apr 11, 2017

Uh oh!

ibm-ompi commented Apr 11, 2017

Uh oh!

ibm-ompi commented Apr 11, 2017

Uh oh!

ggouaillardet commented Jul 7, 2017

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggouaillardet commented Jul 10, 2017

Uh oh!

bosilca commented Jul 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggouaillardet commented Jul 11, 2017

Uh oh!

bosilca commented Jul 11, 2017

Uh oh!

ggouaillardet commented Jul 11, 2017

Uh oh!

Uh oh!

bosilca Feb 9, 2017 •

edited

Loading