Skip to content

error handling large datatypes #6016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ggouaillardet opened this issue Nov 2, 2018 · 2 comments
Closed

error handling large datatypes #6016

ggouaillardet opened this issue Nov 2, 2018 · 2 comments

Comments

@ggouaillardet
Copy link
Contributor

The issue was initially reported at https://www.mail-archive.com/[email protected]/msg20812.html

The inline program above can be used to reproduce the issue.
It has to be run with one MPI task and requires ~64GB memory (!).

#include <mpi.h>
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char * argv[]) {

  MPI_Init(&argc, &argv);

  int const per_process = 192;
  int const per_type = 20000000;
  size_t const bufsize = (size_t)per_type * (size_t)per_process * 4 * (size_t)sizeof(float);

  float * const buffer = malloc(bufsize);

  int scounts[2] = {per_process, per_process};
  int sdispls[2] = {3*per_process, 0*per_process};
  int rcounts[2] = {per_process, per_process};
  int rdispls[2] = {1*per_process, 2*per_process};

  printf ("buffer %p-%p : %p-%p\n", buffer+(size_t)per_type*(size_t)sdispls[0], buffer+(size_t)per_type*(size_t)(sdispls[0]+scounts[0]),
                                    buffer+(size_t)per_type*(size_t)sdispls[1], buffer+(size_t)per_type*(size_t)(sdispls[1]+scounts[1]));

  MPI_Datatype ddt, stype, rtype;
  MPI_Type_contiguous(per_type, MPI_FLOAT, &ddt);
  MPI_Type_indexed(2, scounts, sdispls, ddt, &stype);
  MPI_Type_commit(&stype);
  MPI_Type_indexed(2, rcounts, rdispls, ddt, &rtype);
  MPI_Type_commit(&rtype);

  MPI_Sendrecv(buffer, 1, stype, 0, 0,
               buffer, 1, rtype, 0, 0,
               MPI_COMM_SELF, MPI_STATUS_IGNORE);

  MPI_Type_free(&stype);
  MPI_Type_free(&rtype);
  free(buffer);

  MPI_Finalize();

}

I ran this under the debugger and found Open MPI tries to pack more data than necessary.
At this stage, I could not find why, nor any obvious integer overflow

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Nov 4, 2018
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Nov 4, 2018
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Nov 5, 2018
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Nov 19, 2018
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Nov 19, 2018
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Nov 19, 2018
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Nov 19, 2018
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
@jsquyres
Copy link
Member

jsquyres commented Dec 1, 2018

@ggouaillardet Can you add a derivative of this test program into the ibm test suite? Perhaps modify it to only do the malloc on local rank 0 (so that it can still be run in MTT with more than ppn=1), and put in a check to see if the malloc fails, ...etc.

@bosilca
Copy link
Member

bosilca commented Dec 5, 2018

I added a test in #6029 based on the example presented here, but without any memory allocation.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Dec 6, 2018
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
bosilca pushed a commit to bosilca/ompi that referenced this issue Mar 6, 2019
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
hoopoepg pushed a commit to hoopoepg/ompi that referenced this issue Mar 6, 2019
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit fbb5bb8)

Conflicts:
	opal/datatype/opal_convertor_raw.c
@bosilca bosilca closed this as completed May 9, 2019
bosilca pushed a commit to bosilca/ompi that referenced this issue Sep 13, 2019
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
markalle pushed a commit to markalle/ompi that referenced this issue Sep 12, 2020
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi#6016

Signed-off-by: Gilles Gouaillardet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants