Skip to content

OSC segfaults, v2.x #3267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
artpol84 opened this issue Apr 1, 2017 · 9 comments
Closed

OSC segfaults, v2.x #3267

artpol84 opened this issue Apr 1, 2017 · 9 comments

Comments

@artpol84
Copy link
Contributor

artpol84 commented Apr 1, 2017

Today I noticed a segfault in our MTT:

# OSU MPI_Get Bandwidth Test v5.3.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size      Bandwidth (MB/s)[boo13:12788] *** Process received signal ***
[boo13:12788] Signal: Segmentation fault (11)
[boo13:12788] Signal code: Address not mapped (1)
[boo13:12788] Failing at address: (nil)
[boo13:12788] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff7608100]
[boo13:12788] [ 1] <mtt-base>/install/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_unlock_atomic+0x1f9)[0x7fffe88639c9]
[boo13:12788] [ 2] <mtt-base>/install/lib/libmpi.so.20(PMPI_Win_unlock+0x1b)[0x7ffff7875f2b]
[boo13:12788] [ 3] <mtt-base>/tests/osu_micro_benchmark/osu-micro-benchmarks-5.3.2/mpi/one-sided/osu_get_bw[0x401f25]
[boo13:12788] [ 4] <mtt-base>/tests/osu_micro_benchmark/osu-micro-benchmarks-5.3.2/mpi/one-sided/osu_get_bw[0x4017e5]
[boo13:12788] [ 5] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff7258b15]
[boo13:12788] [ 6] <mtt-base>/tests/osu_micro_benchmark/osu-micro-benchmarks-5.3.2/mpi/one-sided/osu_get_bw[0x401851]
[boo13:12788] *** End of error message ***srun: error: boo13: task 0: Segmentation fault

It was running fine for at least a couple of weeks before. This might be related to recently merged #3045.

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 1, 2017

@karasevb, FIY

@hjelmn
Copy link
Member

hjelmn commented Apr 1, 2017

Hmm. And master is ok?

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 1, 2017

We don't test it

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 1, 2017

And this problem is 100% reproducible. I see it in all runs now.

@artpol84
Copy link
Contributor Author

artpol84 commented Apr 1, 2017

OMPI config is pretty basic - no Mellanox components are in this build.
But we use external pmix-v1.2, hwloc and libevent.
Also we use --with-platform=contrib/platform/mellanox/optimized but no Mellanox components are found.

@hjelmn
Copy link
Member

hjelmn commented Apr 3, 2017

I think I have an idea as to where the problem is. All my testing is done on a platform that has both fetching and non-fetching atomics. libibverbs only provides fetching. Will see if I can trigger the problem by disabling non-fetching atomics in btl/ugni.

hjelmn added a commit to hjelmn/ompi that referenced this issue Apr 3, 2017
@hppritcha hppritcha reopened this Apr 4, 2017
@hppritcha
Copy link
Member

this fixes a regression that went in to v2.x after the v2.1.0 release.

hjelmn added a commit to hjelmn/ompi that referenced this issue Apr 10, 2017
Fixes open-mpi#3267

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit fad0803)
Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Apr 10, 2017
Fixes open-mpi#3267

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit fad0803)
Signed-off-by: Nathan Hjelm <[email protected]>
@jsquyres
Copy link
Member

Looks like this is now fixed.

@hppritcha
Copy link
Member

This issue was fixed by PRs #3274, #3314, and #3315.

@hppritcha hppritcha reopened this Apr 11, 2017
@hjelmn hjelmn closed this as completed Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants