Skip to content

coll/tuned: fix allgather bruck algorithm #262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ggouaillardet
Copy link
Contributor

No description provided.

@ggouaillardet
Copy link
Contributor Author

@bosilca could you please review this ?

i wrote a test program available at https://gist.github.com/ggouaillardet/d0cf43616bd99ffb7821

in order to avoid a crash and/or incorrect output, i had to replace the lower bound (rlb) with the true lower bound (true_lb)
shall the discussion is coll_basic_reduce.c updated to clarify that LB is the true lower bound ?

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/3/

Build Log
last 50 lines


Test FAILed.

@bosilca
Copy link
Member

bosilca commented Nov 7, 2014

I thought about using true_lb instead of lb in the shift of the base pointer. And the more I think about this the more I have the impression that something is fishy altogether. Imagine a datatype representing a column in a 2D matrix stored by row. Such a datatype will have an extent equal to 1 element, while it will have a size of N elements and a span of N*(N-1) elements. The current way we handle this, by using the extent to compute how much memory we need, cannot handle this case. More thinking needed I guess...

@ggouaillardet
Copy link
Contributor Author

At first glance, the ddt has an extent of 1 and a true extent of n_(n-1)+1
So if we allocate the full matrix e.g. n columns, we allocate
true_extent+(n-1)_extent = n^2 elements, which is correct.
Did i miss something ?

@mike-dubman
Copy link
Member

testing jenkins PR commander:
retest this please

@ggouaillardet ggouaillardet force-pushed the poc/coll_tuned_allgather_bruck branch from 3809067 to 63e8522 Compare November 11, 2014 03:47
@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/26/

Build Log
last 50 lines

[...truncated 23083 lines...]
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
5/8 dst = 7 8 9
3/8 dst = 7 8 9
4/8 dst = 7 8 9
2/8 dst = 7 8 9
6/8 dst = 7 8 9
1/8 dst = 7 8 9
7/8 dst = 7 8 9
0/8 dst = 7 8 9
+ '[' yes == yes ']'
+ timeout -s SIGKILL 3m oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca spml yoda -mca pml ob1 -mca btl self,vader /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-3/ompi_install1/examples/oshmem_max_reduction
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 3 (pid 16071, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
[jenkins01:16065] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:16065] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@mike-dubman
Copy link
Member

retest this please

@mellanox-github
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/34/
Test PASSed.

@ggouaillardet
Copy link
Contributor Author

On 2014/11/14 0:29, bosilca wrote:

When the count is larger than 1 things seems to nicely fit together. But what if the count is 1?

in this case we use only true_extent, which is correct ... unless i
missed something (again)

bosilca added a commit to ggouaillardet/ompi that referenced this pull request Nov 14, 2014
true_lb while computing the lower bound.
@mellanox-github
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/42/
Test PASSed.

@ggouaillardet
Copy link
Contributor Author

@bosilca is it legal to MPI_Type_create_resized with a negative extent ?
if yes, then there is some work ...

the test program at https://gist.github.com/ggouaillardet/9b4deebb3f508be07912 emits warnings and crashes on 2 tasks or more with Open MPI
(fwiw, mpich is ok with 2 tasks, crashes with 3 tasks and produce incorrect results with 4 tasks ...)

@bosilca
Copy link
Member

bosilca commented Nov 14, 2014

Well ... based on MPI 3.1 page 107 Section 4.1 it is legal to have a
negative extent. This sounds like a lot of fun in perspective (!)

George.

On Fri, Nov 14, 2014 at 3:27 AM, Gilles Gouaillardet <
[email protected]> wrote:

@bosilca https://github.com/bosilca is it legal to
MPI_Type_create_resized with a negative extent ?
if yes, then there is some work ...

the test program at
https://gist.github.com/ggouaillardet/9b4deebb3f508be07912 emits warnings
and crashes on 2 tasks or more with Open MPI
(fwiw, mpich is ok with 2 tasks, crashes with 3 tasks and produce
incorrect results with 4 tasks ...)


Reply to this email directly or view it on GitHub
#262 (comment).

@ggouaillardet ggouaillardet force-pushed the poc/coll_tuned_allgather_bruck branch from 3196d44 to d622db7 Compare November 21, 2014 10:30
@mellanox-github
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/59/
Test PASSed.

@ggouaillardet ggouaillardet merged commit d622db7 into open-mpi:master Nov 21, 2014
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 21, 2016
…s4.description-v1.8

README: update the Portals4 description
dong0321 pushed a commit to dong0321/ompi that referenced this pull request Feb 17, 2020
Add schizo/singularity component
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants