Skip to content

Also run the CI tests with shared windows disabled #184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fuerlinger opened this issue Dec 4, 2016 · 10 comments
Closed

Also run the CI tests with shared windows disabled #184

fuerlinger opened this issue Dec 4, 2016 · 10 comments

Comments

@fuerlinger
Copy link
Contributor

Once DART successfully builds again with disabled shared windows, we should run CI tests with both configurations: enabled and disabled.

Disabled shared windows more closely resembles the distributed memory setting and might uncover issues that would otherwise go unnoticed. Plus it is needed for NastyMPI to work (intercept puts and gets).

@fmoessbauer
Copy link
Member

I agree, that this should definitely be tested. However this (again) rises the time the CI takes to finish.

Implemented in branch feat-184-ci-sw.

@fmoessbauer
Copy link
Member

On CircleCI the tests fail using the OpenMPI2 container with more than one unit: See CI Output.

@fuchsto please have a look at the logs.

@fmoessbauer fmoessbauer added this to the dash-0.3.0 milestone Dec 6, 2016
@fuchsto
Copy link
Member

fuchsto commented Dec 6, 2016

It's fixed in branch feat-52-nastympi, will be fixed with merge of PR #170

See branch feat-184-shwin

@fuchsto
Copy link
Member

fuchsto commented Dec 7, 2016

@rkowalewski @fmoessbauer @fuerlinger

In PR #189, I re-enabled MPI shared windows in CI builds as there apparently is a bug in OpenMPI 2, if I'm not mistaken.

There is a deadlock where one unit is entering a barrier, and another unit calls MPI_Put to a window shared with the barrier'ed unit (but in it's own segment):

dash-log-openmpi2-deadlock_annotated

This should not block, if I understand the MPI standard correctly.

For the full log, see CircleCI build no. 646:
https://circleci.com/gh/dash-project/dash/646

Note that this failed for OpenMPI 2 only.
For MPICH and OpenMPI 1, the Minimal and NastyMPI test configurations passed.

This defect has could not be provoked by NastyMPI, another hint that it's actually a bug in OpenMPI 2.

@rkowalewski
Copy link

rkowalewski commented Dec 7, 2016

@fuchsto Actually you have been a bit faster than me. You are totally right, OpenMPI-2 blocks in one-sided communications. However this issue happens only if we use multiple MPI windows which is actually the case in DART. The issue has been recently discovered and is documented in Issue #2530 in the Open-MPI Repository.

Additionally, we have another bug in dart_exit. It is a memory leak since we do not free the allocated memory for local memory allocations. Surprisingly,
this causes a segmentation fault in Open-MPI 2 when MPI_Finalize is executed. Actually,
all test cases have never been successfully terminated. I fixed this in a recent commit. Additionally I created an issue for Open-MPI since this segmentation fault should, of course, not happen.

@fuchsto
Copy link
Member

fuchsto commented Dec 7, 2016

Okay, so long story short:

We won't use OpenMPI 2 in CI for now.

This is just one of several arcane defects in OpenMPI 2 I stumbled upon and fixing them doesn't help DASH/DART at all.

@rkowalewski
Copy link

We should revert to OpenMPI 1.10.x

@fmoessbauer
Copy link
Member

Yes and No - I recommend the following:

  • We keep the OpenMPI2 container (but do not run in CI) and from time to time we can manually test if OpenMPI2 works for us.
  • We update the OpenMPI container to OpenMPI 1.10.latest

This is also beneficial, because if we only use only two Circle containers, we can run two commits at a time 👍

@fuchsto
Copy link
Member

fuchsto commented Dec 7, 2016

Yes, that's my plan. Once they got a grip on RDMA again, we will use it.
Good news is that we only need two CircleCI containers now.

... emm, as you already mentioned.

@fuchsto
Copy link
Member

fuchsto commented Dec 7, 2016

Fixed in #189

@fuchsto fuchsto closed this as completed Dec 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants