Skip to content

MPI_Info_create and sessions #12854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mpokorny opened this issue Oct 11, 2024 · 3 comments
Closed

MPI_Info_create and sessions #12854

mpokorny opened this issue Oct 11, 2024 · 3 comments
Assignees

Comments

@mpokorny
Copy link

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

OpenMPI v5.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed using Spack

Please describe the system on which you are running

  • Operating system/version: Pop!_OS 22.04 LTS
  • Computer hardware: 12th Gen Intel(R) Core(TM) i7-1255U
  • Network type: none

Details of the problem

The following program:

#include <mpi.h>
int
main(int argc, char** argv) {

  MPI_Info info;
  MPI_Info_create(&info);
  MPI_Session s1, s2;
  MPI_Session_init(MPI_INFO_NULL, MPI_ERRORS_RETURN, &s1);
  MPI_Session_finalize(&s1);
  MPI_Session_init(MPI_INFO_NULL, MPI_ERRORS_RETURN, &s2);
  MPI_Session_finalize(&s2);
}

fails at runtime, as follows:

$ mpirun -np 1 src/Test 
free(): double free detected in tcache 2
[aeolus:740193] *** Process received signal ***
[aeolus:740193] Signal: Aborted (6)
[aeolus:740193] Signal code:  (-6)
[aeolus:740193] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x74c71b842520]
[aeolus:740193] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x74c71b8969fc]
[aeolus:740193] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x74c71b842476]
[aeolus:740193] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x74c71b8287f3]
[aeolus:740193] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x74c71b889676]
[aeolus:740193] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x74c71b8a0cfc]
[aeolus:740193] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0xa30ab)[0x74c71b8a30ab]
[aeolus:740193] [ 7] /lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x74c71b8a5453]
[aeolus:740193] [ 8] /home/martin/spack/opt/spack/linux-pop22-skylake/gcc-14.1.0/openmpi-5.0.5-dkgact6ph5rgu6fnp5tcfeejp754i7pv/lib/libopen-pal.so.80(+0xe8c49)[0x74c71bba7c49]
[aeolus:740193] [ 9] /home/martin/spack/opt/spack/linux-pop22-skylake/gcc-14.1.0/openmpi-5.0.5-dkgact6ph5rgu6fnp5tcfeejp754i7pv/lib/libmpi.so.40(+0x8dab9)[0x74c71c08dab9]
[aeolus:740193] [10] /home/martin/spack/opt/spack/linux-pop22-skylake/gcc-14.1.0/openmpi-5.0.5-dkgact6ph5rgu6fnp5tcfeejp754i7pv/lib/libmpi.so.40(ompi_mpi_instance_finalize+0xad)[0x74c71c08eecd]
[aeolus:740193] [11] /home/martin/spack/opt/spack/linux-pop22-skylake/gcc-14.1.0/openmpi-5.0.5-dkgact6ph5rgu6fnp5tcfeejp754i7pv/lib/libmpi.so.40(MPI_Session_finalize+0x4c)[0x74c71c0c7b0c]
[aeolus:740193] [12] src/Test[0x4011a5]
[aeolus:740193] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x74c71b829d90]
[aeolus:740193] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x74c71b829e40]
[aeolus:740193] [15] src/ray/Test[0x401085]
[aeolus:740193] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 740193 on node aeolus exited on
signal 6 (Aborted).
--------------------------------------------------------------------------

Removing either the call to MPI_Info_create() or the second call to MPI_Session_finalize() allows the program to complete. Another workaround I've found is to add a call to MPI_Init() prior to the call to MPI_Info_create().

@hppritcha
Copy link
Member

this behavior is apparently connected with changes associated with c2ddb1e

hppritcha added a commit to hppritcha/ompi that referenced this issue Oct 17, 2024
Fix a couple of problems uncovered in issue open-mpi#12854.

Turns out the MCA param management system was "remembering" things
even if a variable was deregistered when a framework was closed.

Also the test case showed that destructing ompi_mpi_session_null
needs to be moved to ompi_mpi_instance_release.

Related to open-mpi#12854

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha
Copy link
Member

@mpokorny thanks for the test case. we'll add it to one of our regression testsuites.

hppritcha added a commit to hppritcha/ompi that referenced this issue Oct 25, 2024
Fix a couple of problems uncovered in issue open-mpi#12854.

Turns out the MCA param management system was "remembering" things
even if a variable was deregistered when a framework was closed.

Also the test case showed that destructing ompi_mpi_session_null
needs to be moved to ompi_mpi_instance_release.

Related to open-mpi#12854

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 155ee56)
@hppritcha
Copy link
Member

closed via #12883 and #12868

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants