Skip to content

mpi4py: Regressions in main: segv in MPI_Init_thread #11433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dalcinl opened this issue Feb 22, 2023 · 8 comments
Closed

mpi4py: Regressions in main: segv in MPI_Init_thread #11433

dalcinl opened this issue Feb 22, 2023 · 8 comments

Comments

@dalcinl
Copy link
Contributor

dalcinl commented Feb 22, 2023

https://github.com/mpi4py/mpi4py-testing/actions/runs/4238187139/jobs/7368190440
https://github.com/mpi4py/mpi4py-testing/actions/runs/4238187139/jobs/7368941976

Please note that the failure happens in a (heavily?) oversubscribed scenario. GitHub Actions runners have two virtual cores, and I'm running there with 5 MPI processes, plus a few more to spawn. Not sure if this is relevant, but this observation may help the experts to figure out what could be going wrong.

PS: Now I' running these tests daily, so the regression should come from very recent changes pushed to main.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 22, 2023

@rhc54 Maybe you can shed some light on this one?

@rhc54
Copy link
Contributor

rhc54 commented Feb 22, 2023

I'm afraid I cannot extract the error from those outputs. Can you provide the specific error that is motivating this issue? I saw something about a segfault in the sm btl and then a server cannot be found, but that has nothing to do with me.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 22, 2023

@rhc54 The issue is happening at MPI_Init_thread of child processes. Given that spawn is involved, I assumed this could be related to PRRTE, but I was probably wrong. Sorry for the noise.

@hppritcha Maybe your changes from #11305?

This is the error from the logs.

testErrcodes (test_spawn.TestSpawnMultipleWorldMany) ... testErrcodes (test_spawn.TestSpawnMultipleWorldMany) ... [fv-az988-774:131667] *** Process received signal ***
[fv-az988-774:131667] Signal: Segmentation fault (11)
[fv-az988-774:131667] Signal code: Address not mapped (1)
[fv-az988-774:131667] Failing at address: 0x180
[fv-az988-774:131667] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc4b29e4520]
[fv-az988-774:131667] [ 1] /usr/local/lib/libmpi.so.0(+0x2bc553)[0x7fc4b1f89553]
[fv-az988-774:131667] [ 2] /usr/local/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_match+0x169)[0x7fc4b1f8b8c2]
[fv-az988-774:131667] [ 3] /usr/local/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x1bb)[0x7fc4b1c5b1ee]
[fv-az988-774:131667] [ 4] /usr/local/lib/libopen-pal.so.0(+0xd62f9)[0x7fc4b1c5b2f9]
[fv-az988-774:131667] [ 5] /usr/local/lib/libopen-pal.so.0(+0xd65d8)[0x7fc4b1c5b5d8]
[fv-az988-774:131667] [ 6] /usr/local/lib/libopen-pal.so.0(opal_progress+0x34)[0x7fc4b1bab822]
[fv-az988-774:131667] [ 7] /usr/local/lib/libmpi.so.0(+0xb058c)[0x7fc4b1d7d58c]
[fv-az988-774:131667] [ 8] /usr/local/lib/libmpi.so.0(ompi_mpi_instance_init+0x79)[0x7fc4b1d7d933]
[fv-az988-774:131667] [ 9] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0x1a6)[0x7fc4b1d6b55b]
[fv-az988-774:131667] [10] /usr/local/lib/libmpi.so.0(PMPI_Init_thread+0xe0)[0x7fc4b1dd6cf8]
[fv-az988-774:131667] [11] /home/runner/work/mpi4py-testing/mpi4py-testing/mpi4py/build/lib.linux-x86_64-cpython-310/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x40390)[0x7fc4b2181390]
[fv-az988-774:131667] [12] /home/runner/work/mpi4py-testing/mpi4py-testing/mpi4py/build/lib.linux-x86_64-cpython-310/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x1831fc)[0x7fc4b22c41fc]
[fv-az988-774:131667] [13] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(PyModule_ExecDef+0x73)[0x7fc4b2de39c3]
[fv-az988-774:131667] [14] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(+0x23acc0)[0x7fc4b2e10cc0]
[fv-az988-774:131667] [15] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(+0x17bfa7)[0x7fc4b2d51fa7]
[fv-az988-774:131667] [16] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x6042)[0x7fc4b2d91f32]
[fv-az988-774:131667] [17] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(+0x1b5116)[0x7fc4b2d8b116]
[fv-az988-774:131667] [18] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x51ed)[0x7fc4b2d910dd]
[fv-az988-774:131667] [19] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(+0x1b5116)[0x7fc4b2d8b116]
[fv-az988-774:131667] [20] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x792)[0x7fc4b2d8c682]
[fv-az988-774:131667] [21] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(+0x1b5116)[0x7fc4b2d8b116]
[fv-az988-774:131667] [22] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x357)[0x7fc4b2d8c247]
[fv-az988-774:131667] [23] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(+0x1b5116)[0x7fc4b2d8b116]
[fv-az988-774:131667] [24] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x357)[0x7fc4b2d8c247]
[fv-az988-774:131667] [25] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(+0x1b5116)[0x7fc4b2d8b116]
[fv-az988-774:131667] [26] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(+0x1562a6)[0x7fc4b2d2c2a6]
[fv-az988-774:131667] [27] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(_PyObject_CallMethodIdObjArgs+0x13a)[0x7fc4b2d2d7da]
[fv-az988-774:131667] [28] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(PyImport_ImportModuleLevelObject+0x3a2)[0x7fc4b2da2112]
[fv-az988-774:131667] [29] /opt/hostedtoolcache/Python/3.10.10/x64/lib/libpython3.10.so.1.0(+0x1b36bc)[0x7fc4b2d896bc]
[fv-az988-774:131667] *** End of error message ***
[fv-az988-774:00000] *** An error occurred in MPI_Init_thread
[fv-az988-774:00000] *** reported by process [520945736,0]
[fv-az988-774:00000] *** on a NULL communicator
[fv-az988-774:00000] *** Unknown error
[fv-az988-774:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fv-az988-774:00000] ***    and MPI will try to terminate your MPI job as well)

@hppritcha
Copy link
Member

hmm, i thought that last commit i pushed to #11305 addressed this behavior.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 23, 2023

@hppritcha It can confirm the regression is related to your changes

@jsquyres jsquyres changed the title mpi4py: Regressions in main mpi4py: Regressions in main: segv in MPI_Init_thread Feb 23, 2023
@hppritcha
Copy link
Member

@dalcinl do you see the behavior with any of the non-spawn tests?

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 23, 2023

@hppritcha Well, I'm not sure what to say... It depends on the day 😞

This is the first time I got the failure from a scheduled build yesterday (commit 478b6b2):
https://github.com/mpi4py/mpi4py-testing/actions/runs/4238187139/jobs/7368941976
singleton and np=1 to np=4 was everything OK, things only failed with np=5.

This other failure if from a build I triggered manually today (same commit 478b6b2):
https://github.com/mpi4py/mpi4py-testing/actions/runs/4250501476/jobs/7391684510
this time the failure happened in singleton mode, and at that point the run was stopped as failed.

@hppritcha hppritcha self-assigned this Feb 24, 2023
@hppritcha
Copy link
Member

closed via #11445
there is no need to address v5.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants