Skip to content

List of issues with OMPI master+ULT+PartCom #10459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 of 6 tasks
janciesko opened this issue Jun 8, 2022 · 21 comments
Open
4 of 6 tasks

List of issues with OMPI master+ULT+PartCom #10459

janciesko opened this issue Jun 8, 2022 · 21 comments

Comments

@janciesko
Copy link
Contributor

janciesko commented Jun 8, 2022

This tracks current issues for ULT- and PartCom support in OMPI.

ULT Support:

  • Compilation fails on current master @ HEAD when configured to --with-threads={qthreads,argobots} with
    ../../opal/mca/threads/qthreads/threads_qthreads.h:29:10: fatal error: qthread.h: No such file or directory 29 | #include "qthread.h" Configure seems to be not setting -I/PATH_TO_ULT_LIB
  • OMPI hangs when configured to --with-threads=argobots and argobots uses multiple streams and --map-by ppr:1:node thus using ucx and libevent. Requires https://github.com/shintaro-iwasaki/libevent/commits/2.0.22-abt in libevent. We need to merge that branch into libevent soon. Check why Qthreads works.
  • MPI_Init_thread hangs on current master @ HEAD when configured to --with-threads={qthreads,argobots}. Maybe this is an issue in instance/instance.h after merging sessions-related code changes.
  • UCX fails asserts on POW9 and ARM: Discussed this with Howard. An assert is failing in src/ucp/core/ucp_worker.c. I'll add more info once the latest OMPI version compiles.

Partitioned Communication:

  • Non equal numbers of send- and receive partitions leads to deadlock. Looking at a backtrace, it seems that the receiving rank waits for the last receive partition. MPI Partix is a reproducer when setting DEFAULT_RECV_SEND_PARTITION_RATIO=1 to another value such as 2.

General improvements:

  • Add ULT coverage to CI testing

Reproducers of all above:

git clone https://github.com/sandialabs/MPI-Partix.git
export PATH=$OMPI_INSTALL_PATH/bin
cmake ..  -DCMAKE_CXX_COMPILER=mpicxx -DQthreads_ROOT=$QTHREADS_INSTALL_PATH -DPartix_ENABLE_QTHREADS=ON
mpirun -np 2 --map-by ppr:1:node ./bench1
@jsquyres
Copy link
Member

@shintaro-iwasaki Can you have a look at this?

@shintaro-iwasaki
Copy link
Contributor

As I left Argonne a year ago, I might not be the right person for this, but please let me know anything I can help. Jan should have known it.

(Currently, Yanfei Guo is the POC on the Argobots side.)

@jsquyres
Copy link
Member

@shintaro-iwasaki Gotcha. What's Yanfei Guo's github ID?

@shintaro-iwasaki
Copy link
Contributor

@yfguo He is leading the MPICH project.

@jsquyres
Copy link
Member

@yfguo It looks like there are some errors with the qthreads implementation in Open MPI. @shintaro-iwasaki used to be the maintainer of this, but now -- perhaps by default? -- the ownership has apparently transferred to you...? We're having some compile issues as listed in this issue. Can you investigate?

@shintaro-iwasaki
Copy link
Contributor

shintaro-iwasaki commented Jun 16, 2022

Sorry for interruption. I believe @janciesko owns the Qthreads/Open MPI part. For example, #8479 and #8500

I think Jan posted this issue to fix it later.

@jsquyres
Copy link
Member

Ah, ok. @janciesko Can you also update the owner.txt files so that we don't ping the wrong people in the future? 😄

@awlauria
Copy link
Contributor

@janciesko is there any update on this? We may look to disable these for the v5.0.0 release if not completed in time, since they are new features it would not be a regression. That said, we would love to have them. We're targeting an end of summer release, so end of August/September timeframe. Thanks!

@janciesko
Copy link
Contributor Author

janciesko commented Jul 26, 2022

@awlauria, thanks for clarifying the time scope for this. I'll look into it and report back.

@janciesko
Copy link
Contributor Author

@bwbarrett, can you please take a look why ompi fails to compile when enabling the use of Qtheads or Argobots?
The following commit introduced the error: 6767309.
Configure fails to set OPAL_QTHREADS_INCLUDE_PATH in ./opal/mca/threads/qthreads/threads_qthreads.h.in
Same behavior applies to Argobots.

@hppritcha
Copy link
Member

I am working on a fix for the first item.

hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 9, 2022
The OAC_CHECK_PACKAGE macro doesn't have quite the same
functionality as the origina ?OPAL_CHECK_PACKAGE? macro,
so add some sed magic to get the path we need for
argo/qthread related header files.

related to open-mpi#10459

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 9, 2022
The OAC_CHECK_PACKAGE macro doesn't have quite the same
functionality as the origina ?OPAL_CHECK_PACKAGE? macro,
so add some sed magic to get the path we need for
argo/qthread related header files.

related to open-mpi#10459

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 12, 2022
The OAC_CHECK_PACKAGE macro doesn't have quite the same
functionality as the origina ?OPAL_CHECK_PACKAGE? macro,
so add some sed magic to get the path we need for
argo/qthread related header files.

related to open-mpi#10459

Signed-off-by: Howard Pritchard <[email protected]>
@janciesko
Copy link
Contributor Author

janciesko commented Aug 25, 2022

This

opal_mutex_unlock (&instance_lock);
causes a deadlock as that lock is already taken by the executing thread. The lock is previously set here:
opal_mutex_lock (&instance_lock);

#10720 fixes item three.

@janciesko
Copy link
Contributor Author

Item three needs support for recursive locks in Qthreads. This is in progress.

hppritcha added a commit to hppritcha/ompi that referenced this issue Sep 1, 2022
Accomodate changes from going from OPAL_CHECK_PACKAGET to  OAC_CHECK_PACKAGE.

related to open-mpi#10459

Signed-off-by: Howard Pritchard <[email protected]>
janciesko pushed a commit to janciesko/ompi that referenced this issue Sep 1, 2022
The OAC_CHECK_PACKAGE macro doesn't have quite the same
functionality as the origina ?OPAL_CHECK_PACKAGE? macro,
so add some sed magic to get the path we need for
argo/qthread related header files.

related to open-mpi#10459

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Sep 6, 2022
Accomodate changes from going from OPAL_CHECK_PACKAGET to  OAC_CHECK_PACKAGE.

related to open-mpi#10459

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit ef38d92)
@janciesko
Copy link
Contributor Author

janciesko commented Sep 8, 2022

We have added a draft PR for supporting recursive locks in Qthreads. sandialabs/qthreads#111
With this feature, we resolve the deadlock reported in item three.

@janciesko
Copy link
Contributor Author

janciesko commented Sep 16, 2022

Update: We have added a PR for supporting recursive locks in Qthreads. sandialabs/qthreads#112

@jsquyres
Copy link
Member

Does Open MPI need to use some minimum version of Qthreads in order to ensure correctness? If so, it might be worthwhile to enforce that in the qthreads component's configure.m4.

@janciesko
Copy link
Contributor Author

@jsquyres, yes and yes, we'd need a version check for qthreads to be >=1.18.

@janciesko
Copy link
Contributor Author

#108032 fixes item three.

@janciesko
Copy link
Contributor Author

I am marking issue 4 as complete as I cannot reproduce that issue anymore.

@janciesko
Copy link
Contributor Author

janciesko commented Oct 10, 2022

I am marking issue 2 as complete as I cannot reproduce that issue anymore. In this case I am using the default downstream dep on libevent 2.1.12.
The issue persists and is specific to libevent. In lieu of the check box item I have created a new issue for libevent. I am marking checkbox 2 as complete.

@janciesko
Copy link
Contributor Author

@jsquyres, addressing the last checkbox, what methodology would you propose to test OMPI and the threading MCA when configuring OMPI with a supported ULT lib? ASFAIK we do not test this. We could add a test similar to this one and test with a release version of both ULT libs?

@jsquyres jsquyres modified the milestones: v5.0.0, v5.0.1 Oct 30, 2023
@janjust janjust modified the milestones: v5.0.1, v5.0.2 Jan 8, 2024
yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
Accomodate changes from going from OPAL_CHECK_PACKAGET to  OAC_CHECK_PACKAGE.

related to open-mpi#10459

Signed-off-by: Howard Pritchard <[email protected]>
@jsquyres jsquyres modified the milestones: v5.0.2, v5.0.3 Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants