Skip to content

ompi-tests/oneside/test_start2 hang with large group size on ompi master #9447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wzamazon opened this issue Sep 30, 2021 · 10 comments
Closed

Comments

@wzamazon
Copy link
Contributor

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

ompi master branch + PR #9400

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

compiled from source

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: amazon linux 2
  • Computer hardware: intel cpu
  • Network type: tcp

Details of the problem

I encountered this issue when testing my PR #9400, which fix a set of issues with osc/rdma + btl/tcp. I did some debug and know the root cause of the issue is not my PR, but something in osc/rdma.

But first, I want to provide a reproducer:

  1. Git open mpi master branch
  2. Apply PR osc/rdma, btl/tcp: fix various issues with osc/rdma #9400
  3. compile open mpi
  4. compile ompi-tests/onesided tests
  5. run the onesided/test_start2 test with tcp with a larger group size (I used 72 ranks on 2 nodes). The following is the command line I used:
mpirun -np 72 -N 36 --machinefile /fsx/ALinux2/job/prod/2instances \
        --output tag --mca osc rdma \
        --mca btl tcp,self \
        /fsx/ALinux2/dev/openmpi/ompi-tests/onesided/test_start2

The test will hang.

The start2 test does the post-start-complete-wait on the group consisted with the world communicator.

Initial investigation shows that different ranks hang at different stages:

some ranks passed the post stage, and hang in the start stage.
other ranks hang at post.

@wzamazon
Copy link
Contributor Author

The root cause of this hang is a deadlock situation. Among the ranks that hangs at the post stage, there is one rank that is special, it has submitted an atomic operation, and is waiting for the completion, for that it calls the progress engine (opal_progress). The completion has already arrived. However, because other ranks keep sending messages to this special rank, the progress engine keep process these incoming messages, and can never finish. Note that the very reason other ranks keep sending messages to this special rank is because this special rank cannot move forward.

The following is the detail:

In the post stage, each rank will need to setup a channel to every other rank in the group. osc/rdma implement that by two btl atomic operations:

First, a rank uses btl's atomic fetch to get a post_index from the peer.

Second, a rank uses btl's atomic compare and swap to occupy a communication channel of the peer. One thing worth noting is that the number of communication channel is limited(https://github.com/open-mpi/ompi/blob/master/ompi/.mca/osc/rdma/osc_rdma_types.h#L113), so channel_ID is post_index % number_of_channels. In case the designated channel has already been occupied, a rank will have to wait.

The problem is both atomic actions are implemented as blocking operation, which means osc/rdma submitted an atomic operation then call progress engine (opal_progress) to wait for it to complete. However, for one rank, opal_progress has endless events to process and it will never finish. Other ranks are waiting for this rank to release a communication channel, so they hang too.

Following is an example of how the deadlock can happen:

We have a group of 72 ranks.

Rank X is the special rank.

All other 71 ranks need to setup a post channel to rank X, the first 32 are the lucky ones, and each of them got a channel.

The rest 39 ranks have to wait, They keep calling btl_cswap on rank X, and waiting for a channel to become available. Because btl/tcp uses active message RDMA for atomic, the other 39 ranks will keep sending messages to rank X.

Rank X itself need to setup post channels to others, for that it will submit btl atomic operations, then call opal_progress to wait for the atomic operation to finish. opal_progress actually get the response rank X is waiting. However, because the other 39 ranks keep sending messages to rank X, opal_progress have to process these requests, so it never finish.

@bosilca
Copy link
Member

bosilca commented Oct 1, 2021

I'm not sure I understand how you can have an endless events on a particular rank that would block opal_progress from returning upstream. The opal_progress only pull events from the network once, and once all are completely handled it returns to the caller, which means the upper level get a chance to move forward if it has received the expected answer.

@wzamazon
Copy link
Contributor Author

wzamazon commented Oct 1, 2021

The opal_progress only pull events from the network once, and once all are completely handled it returns to the caller, which means the upper level get a chance to move forward if it has received the expected answer.

Thank you! This is what I observed, and I also found it puzzling. I will look into opal_progress more.

@wzamazon
Copy link
Contributor Author

wzamazon commented Oct 5, 2021

@bosilca

Earlier you mentioned that

The opal_progress only pull events from the network once

I believe you meant the opal_progress call libevent's event_base_loop with the EVLOOP_ONCE flag.

I found that it is not the case. The default opal_progress_event_flag indeed has OPAL_EVLOOP_ONCE, as seen here

However, this flag was removed during MPI_Init, (here).

#if OPAL_ENABLE_PROGRESS_THREADS == 0
    /* Start setting up the event engine for MPI operations.  Don't
       block in the event library, so that communications don't take
       forever between procs in the dynamic code.  This will increase
       CPU utilization for the remainder of MPI_INIT when we are
       blocking on RTE-level events, but may greatly reduce non-TCP
       latency. */
    opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK);
#endif

So the behavior of opal_progress is indeed poll network until there is no active event, which caused deadlock of test_start2 (similar deadlock can be seen with test_lock4).

I wonder which is the intended behavior of opal_progress: poll once or poll until there is no active event?

@bwbarrett @rhc54, I would appreciate your input on this topic.

@rhc54
Copy link
Contributor

rhc54 commented Oct 6, 2021

I've always used EVLOOP_ONCE as I've encountered similar issues with the NONBLOCK option. However, that is in the RTE - I don't have much to do with opal_progress. I believe it is possible to make NONBLOCK work, but might take a significant effort to ensure you coordinate across all parts of the code base that use the event library so you don't deadlock.

@bosilca
Copy link
Member

bosilca commented Oct 6, 2021

EVLOOP_ONCE waits until something, either event or timeout, happens. EVLOOP_NONBLOCK does one iteration and then return whatever happened during that iteration. Thus, in both cases, as soon as at least one event has been triggered (and apparently you have more than one in your example), the call should have returned. This is well documented in the libevent documentation and in the source code, event.c:2012.

@wzamazon
Copy link
Contributor Author

wzamazon commented Oct 6, 2021

I am not sure I agree with the interpretation of EVLOOP_NONBLOCK and EVLOOP_ONCE

The following is an excerpt from https://github.com/libevent/libevent/blob/master/event.c (between line 1981 and line 2054)

	while (!done) {
                .....

		if (N_ACTIVE_CALLBACKS(base)) {
			int n = event_process_active(base);
			if ((flags & EVLOOP_ONCE)
			    && N_ACTIVE_CALLBACKS(base) == 0
			    && n != 0)
				done = 1;
		} else if (flags & EVLOOP_NONBLOCK)
			done = 1;
	}

My reading of the code is:

By default, event_base_loop is blocking, e.g. it will wait for something to happen.

EVLOOP_NONBLOCK means NONBLOCK: if nothing happened (N_ACTIVE_CALLBACKS(base)==0), return immediately.

EVLOOP_ONCE means: when something happened, only do 1 iteration.

So without the EVLOOP_ONCE flag, event_base_loop would loop until there is no active event.

@bosilca
Copy link
Member

bosilca commented Oct 6, 2021

you're right, NONBLOCK is never tested for as long as there are active events in the system, a behavior that contradicts what the documentation states. In fact EVLOOP_ONCE also has a different behavior than documented as it only returns if at the end of one iteration all triggered events were completed and at least one of them was non internal.

@wzamazon
Copy link
Contributor Author

wzamazon commented Oct 6, 2021

@bosilca Thank you!

Based on the comment in the code and the commit message, I believe the current code intended to make sure OPAL_EVLOOP_NONBLOCK in opal_progress_event_flags, but removed OPAL_EVLOOP_ONCE by accident.

The comment for the function opal_progress_set_event_flags is also outdated, which might caused the problem.

So I opened #9480 to fix the issues.

@wzamazon
Copy link
Contributor Author

PR merged and back ported (#9573)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants