Skip to content

Fujitsu: MPI_GATHER (linear_sync) can be truncated with derived datatypes #134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ompiteam opened this issue Oct 1, 2014 · 10 comments
Open

Comments

@ompiteam
Copy link
Contributor

ompiteam commented Oct 1, 2014

Per http://www.open-mpi.org/community/lists/devel/2012/01/10215.php, MPI_GATHER using coll:tuned, linear_sync can be truncated improperly.

I slightly modified the program that was originally sent and attached it here. It shows the problem for me on trunk and v1.5 (I assume it's also a problem on v1.4).

Many thanks for the bug report from Fujitsu.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Imported from trac issue 2981. Created by jsquyres on 2012-01-26T17:42:14, last modified: 2014-05-20T17:59:11

  • jsquyres attached gather.c on 2012-01-26 17:42:35

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2012-01-26 17:43:12:

Oops -- this is a DDT issue, and I meant to assign it to George. :-)

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2012-04-17 11:18:51:

George -- can you have a look?

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2012-04-24 14:05:39:

No fix provided yet -- pushing to 1.6.1.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bosilca on 2014-05-20 17:59:11:

This is a more general issue we have in Open MPI with the tuned collectives. If the send and the receive datatypes and counts are not identical, the message splitting decision is wrong (as it split in repetitions of the entire datatype), leading to truncation in the best case and to wrong messages in the worst one. Without going through a packed version, there is no easy fix.

yosefe pushed a commit to yosefe/ompi that referenced this issue Mar 5, 2015
Resolve thread safety in TCP BTL jenkins: threads, known_issues
lrrajesh added a commit to lrrajesh/ompi that referenced this issue Mar 19, 2015
@hppritcha
Copy link
Member

@bosilca can this be closed?

@bosilca
Copy link
Member

bosilca commented Feb 3, 2020

This isn't fixed and will not going to be. The simplest solution for application requiring collective with different type signature (but same typemap) is to disable all pipelining for MPI collectives.

@gpaulsen
Copy link
Member

gpaulsen commented Feb 3, 2020

@bosilca Is there a way to just disable the pipelining for MPI collectives? I think the big hammer is disabling the entire tuned collective component, but perhaps there's a better approach?
I see you can force a non-pipelined algorithm for both bcase and reduce algorithms, but is there a better approach?

@bosilca
Copy link
Member

bosilca commented Feb 3, 2020

First, all pipeline algorithms suffers from this issue, not only those in the tuned collectives. Second, disabling tuned or more generally disabling pipelining will have a drastic performance impact on most applications (and not only for DL). Last, tuned is the only collective component that supports MPI_T as a mean to configure the collective decision per communicator (and there are several example on our mailing lists on how to achieve this for the tuned module).

@kawashima-fj
Copy link
Member

related (but not same): #199 #1763

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants