-
Notifications
You must be signed in to change notification settings - Fork 900
bug in datatype handling #7019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@tjahns Thank you for reporting, and root causing. Our active release branches are currently v3.0.x, v3.1.x, and v4.0.x, and so we are not currently planning any more releases on earlier branches. There is some precedent to put important fixes on earlier branches in case users wish to download and build themselves. @bosilca do you have time to look at this? |
The problem is that for a yet unknown reason the datatype engine thinks both data generated by your tests are empty, containing no predefined types.
|
I take that back, the test is using MPI_INTEGER (a fortran datatype) that translate to an empty datatype when Fortran support is turned off. |
Here is a trimmed version of the second program (the issue is really in the send type) #include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
MPI_Datatype types[6], ddt;
MPI_Type_vector(4, 1, 0, MPI_INT, types+0);
MPI_Type_dup(MPI_INT, types+1);
MPI_Type_vector(4, 2, 1, MPI_INT, types+2);
MPI_Type_vector(3, 1, 0, MPI_INT, types+3);
MPI_Type_dup(MPI_INT, types+4);
int blkls[5] = { 6, 1, 1, 6, 1 };
int displs[5] = { 0, 5, 8, 8, 13 };
MPI_Type_indexed(5, blkls, displs, MPI_INT, types+5);
int blkls2[6] = { 1, 1, 1, 1, 1, 1 };
MPI_Aint disps2[6] = { 36, 40, 40, 56, 68, 68};
MPI_Datatype ddt_send_from_0_to_0_0_0;
MPI_Type_create_struct(6, blkls2, disps2, types, &ddt);
for (int i=0; i<6; i++) MPI_Type_free(types+i);
MPI_Type_commit(&ddt);
int A[32], B[32];
for (int i=0; i<32; i++) A[i] = i;
for (int i=0; i<32; i++) B[i] = -1;
int ref[32] = {
9, 9, 9, 9,
10,
10,11,11,12,12,13,13,14,
14,14,14,
17,
17,18,19,20,21,22,
22,
25,
25,26,27,28,29,30,
30
};
MPI_Sendrecv(A, 1, ddt, 0, 1,
B, 32, MPI_INT, 0, 1,
MPI_COMM_SELF, MPI_STATUS_IGNORE);
if (memcmp(B, ref, 32*sizeof(int))) {
printf ("KO\n");
printf ("EXPECTED: ");
for (int i=0; i<32; i++) printf ("%2d ", ref[i]);
printf ("\n");
printf ("GOT : ");
for (int i=0; i<32; i++) printf ("%2d ", B[i]);
printf ("\n");
printf (" ");
for (int i=0; i<32; i++) printf ((ref[i] == B[i])?" ":"^^ ");
printf ("\n");
MPI_Abort(MPI_COMM_WORLD, 1);
} else {
printf ("OK\n");
}
MPI_Finalize();
return 0;
} |
just checked and with 4.0.2rc2 Gilles version passes. |
I understand how this issue is triggered. Basically, the optimization of the datatype in the commit is overzealous and merges 2 descriptions that should not be merged. It basically transform
into
The problem here was that the first of the two predefined types in the original description has an extent of 0 (it is a legal way to send multiple times the same data), but this 0 extent is lost during the optimization as it is transformed into an extent of 4 bytes (with an updated count). This is exactly the problem that was fixed in #6952 but as I mentionned in #6978 it is too time consuming to backport the patch. That being said you are not completely out of luck. I came up with a quick fix, dirty as it might cut others optimizations, but at least it seems to solve the issue highlighted here. diff --git a/opal/datatype/opal_datatype_optimize.c b/opal/datatype/opal_datatype_optimize.c
index 882e3a8..2ab02db 100644
--- a/opal/datatype/opal_datatype_optimize.c
+++ b/opal/datatype/opal_datatype_optimize.c
@@ -210,7 +210,8 @@ opal_datatype_optimize_short( opal_datatype_t* pData,
continuity = ((last_disp + (ptrdiff_t)last_length * (ptrdiff_t)opal_datatype_basicDatatypes[last_type]->size)
== (total_disp + pData->desc.desc[pos_desc].elem.disp));
- if( (pData->desc.desc[pos_desc].elem.common.flags & OPAL_DATATYPE_FLAG_CONTIGUOUS) && continuity &&
+ if( ((pData->desc.desc[pos_desc].elem.common.flags & OPAL_DATATYPE_FLAG_CONTIGUOUS) && continuity) &&
+ ((0 == last_length) || (last_extent == (int32_t)opal_datatype_basicDatatypes[last_type]->size)) &&
(pData->desc.desc[pos_desc].elem.extent == (int32_t)opal_datatype_basicDatatypes[type]->size) ) {
if( type == last_type ) {
last_length += pData->desc.desc[pos_desc].elem.count; |
@bosilca Thanks for investigating, and the clear write up as to how it might affect the release branches. We're tagging v4.0.2rc3 today, and this write-up helped us understand that we don't need to hold it for this issue. |
Thanks for investigating, I should probably check master too next time. I'll see if I can get packagers like spack to adopt the patch for as many versions as feasible. |
Hey @bosilca -- can that 2-line patch be applied to v3.0.x / v3.1.x? |
It was designed for the 3.x series, it should apply smoothly to both branches. A quick check seems to indicate it could also be applied (shifted by few lines) to the 2.x if someone cares. |
Do not remove 0-extent instances while optimizing the datatype. This is a distillation of a larger set of DDT bug fixes and optimizations from master. George tried to cherry-pick these fixes+optimizations back to the v3.0.x/v3.1.x branches and it resulted in a giant list of heavily-modified cherry-picked commits (see open-mpi#6978). This did not seem like a good idea. George therefore distilled the critical bug fix down to this small commit, suitable for both the v3.0.x/v3.1.x branches. See open-mpi#7019 (comment) for details. Signed-off-by: Jeff Squyres <[email protected]>
Do not remove 0-extent instances while optimizing the datatype. This is a distillation of a larger set of DDT bug fixes and optimizations from master. George tried to cherry-pick these fixes+optimizations back to the v3.0.x/v3.1.x branches and it resulted in a giant list of heavily-modified cherry-picked commits (see open-mpi#6978). This did not seem like a good idea. George therefore distilled the critical bug fix down to this small commit, suitable for both the v3.0.x/v3.1.x branches. See open-mpi#7019 (comment) for details. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit b77ee56)
I've found and now finally generated a small reproducer for a bug in handling of mildly complex datatypes introduced in commit 639f4b1 by @bosilca
In 2.0.3 the problem can simply be fixed by copying over the prior version of opal/datatype/opal_datatype_add.c from 2.0.2 but for more recent versions the fix is not so straight-forward obviously
The issue affects all versions of OpenMPI from 2.0.3 onwards including 4.0.1.
I've tested 2.0.2, 2.0.3, 2.0.4 and 4.0.1 and a colleague tested 2.1. In all cases except 2.0.2 for which I had an operating system package (but also verified from source tarball) we built from released source tarballs.
I've tested this on multiple installations of OpenMPI on Linux (Debian 9 and RedHat 6)
x86_64
Can be reproduced with IB, tcp, MPI_COMM_SELF and shared memory.
Details of the problem
The two attached example programs both build with a simple mpicc invocation. In case the C compiler is a little older, -std=gnu99 might be needed for the second example. Both programs are expected to silently report success but will fail with diagnostic output on affected OpenMPI versions.
openmpi_datatype.zip
The text was updated successfully, but these errors were encountered: