Skip to content

coll/basic allgatherv wrong answer on different datatypes/counts #1907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jjhursey opened this issue Jul 27, 2016 · 3 comments
Closed

coll/basic allgatherv wrong answer on different datatypes/counts #1907

jjhursey opened this issue Jul 27, 2016 · 3 comments
Assignees
Milestone

Comments

@jjhursey
Copy link
Member

Test Case: https://gist.github.com/jjhursey/508037aa535c7dd1fe2e64610675c280

The above linked testcase performs collectives using the same type signature at each rank in the collective call, but using different types and counts to achieve that signature. It uses either MPI_LONG_LONG, a non-contiguous MPI_LONG_LONG followed by a space, or contiguous 2x MPI_LONG_LONG as its three datatypes.

When run with -np 4 (either on the same node or across two nodes) the MPI_Allgatherv failed with a wrong answer. The test case also includes MPI_Bcast and MPI_Allgather that pass.

shell$ mpirun -np 4 -mca coll ^hcoll,tuned ./coll_non_uniform_types
- testbcast 16
- testbcast 112
- testbcast 1008
- testbcast 10000
- testbcast 100000
- testbcast 1000000
- testallgather 16
- testallgather 112
- testallgather 1008
- testallgather 10000
- testallgather 100000
- testallgather 1000000
- testallgatherv 16
R0 buf[1] is 8897841259083430779, want 1 (from 0)
abort: wrong data(6)
R2 buf[1] is 8897841259083430779, want 1 (from 0)
abort: wrong data(6)
R1 buf[1] is 8897841259083430779, want 1 (from 0)
abort: wrong data(6)
R3 buf[1] is 8897841259083430779, want 1 (from 0)
abort: wrong data(6)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 16.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Note that this test passes with the tuned component. Below is the expected output:

shell$ mpirun -np 4 -mca coll ^hcoll ./coll_non_uniform_types
- testbcast 16
- testbcast 112
- testbcast 1008
- testbcast 10000
- testbcast 100000
- testbcast 1000000
- testallgather 16
- testallgather 112
- testallgather 1008
- testallgather 10000
- testallgather 100000
- testallgather 1000000
- testallgatherv 16
- testallgatherv 112
- testallgatherv 1008
- testallgatherv 10000
- testallgatherv 100000
- testallgatherv 1000000
@jjhursey jjhursey added this to the v2.0.1 milestone Jul 27, 2016
@jjhursey
Copy link
Member Author

This may be related to Issue #1763 (but that one is focused on a defect in coll/tuned)

@ggouaillardet ggouaillardet self-assigned this Jul 28, 2016
@ggouaillardet
Copy link
Contributor

@jjhursey this is a different issue, i am testing a fix right now

ggouaillardet added a commit to ggouaillardet/ompi-release that referenced this issue Jul 28, 2016
ggouaillardet added a commit to ggouaillardet/ompi-release that referenced this issue Jul 28, 2016
@jjhursey
Copy link
Member Author

jjhursey commented Aug 1, 2016

Thanks @ggouaillardet!

bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants