Skip to content

Deadlock with UCX when performing MPI_Fetch_and_op #6546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
devreal opened this issue Mar 29, 2019 · 13 comments
Closed

Deadlock with UCX when performing MPI_Fetch_and_op #6546

devreal opened this issue Mar 29, 2019 · 13 comments

Comments

@devreal
Copy link
Contributor

devreal commented Mar 29, 2019

I have configured Open MPI 4.0.1 using Open UCX 1.5 and with IB verbs disabled (both Open MPI and Open UCX were compiled with --with-debug). I'm running a benchmark that performs a bunch of MPI_Fetch_and_op, with the target rank waiting for all operations by other ranks to finish (by waiting for an ibarrier) and randomly performing local updates using MPI_Fetch_and_op. I'm attaching the code, it's the same that is used in #6536

Running with 2 ranks on 2 nodes of our IB cluster, the application gets stuck in the first MPI_Fetch_and_op. DDT reports:

Processes,Threads,Function
1,1,main (mpi_fetch_op_local_remote.c:55)
1,1,  PMPI_Fetch_and_op
1,1,    ompi_osc_ucx_fetch_and_op
1,1,      ucp_worker_progress (ucp_worker.c:1426)
1,1,        uct_worker_progress (uct.h:1677)
1,1,          ucs_callbackq_dispatch (callbackq.h:209)
1,1,            uct_rc_verbs_iface_progress (rc_verbs_iface.c:116)
1,1,              uct_rc_verbs_iface_poll_tx (rc_verbs_iface.c:83)
1,1,                uct_ib_poll_cq (ib_device.h:289)
1,1,                  ibv_poll_cq (verbs.h:2056)
1,1,                    ??
1,1,main (mpi_fetch_op_local_remote.c:74)
1,1,  PMPI_Fetch_and_op
1,1,    ompi_osc_ucx_fetch_and_op
1,1,      ucp_atomic_fetch_nb (amo_send.c:141)
1,1,        ucp_rma_send_request_cb (rma.inl:20)
1,1,          ucp_request_send (ucp_request.inl:201)
1,1,            ucp_request_try_send (ucp_request.inl:168)
1,1,              ucp_amo_sw_progress_fetch (amo_sw.c:84)
1,1,                ucp_amo_sw_progress (amo_sw.c:59)
1,1,                  uct_ep_am_bcopy (uct.h:1892)
1,1,                    uct_self_ep_am_bcopy (self.c:280)
1,1,                      uct_self_iface_sendrecv_am (self.c:130)
1,1,                        uct_iface_invoke_am (uct_iface.h:535)
1,1,                          ucp_atomic_req_handler (amo_sw.c:235)
1,1,                            ucp_request_send (ucp_request.inl:201)
1,1,                              ucp_request_try_send (ucp_request.inl:168)
1,1,                                ucp_progress_atomic_reply (amo_sw.c:121)
1,1,                                  uct_ep_am_bcopy (uct.h:1892)
1,1,                                    uct_self_ep_am_bcopy (self.c:280)
1,1,                                      uct_self_iface_sendrecv_am (self.c:133)
1,1,                                        ucs_mpool_put_inline (mpool.inl:77)
1,1,                                          ucs_mpool_obj_to_elem (mpool.inl:64)
2,2,ucs_async_thread_func (thread.c:93)
2,2,  epoll_wait
2,4,progress_engine
2,4,  opal_libevent2022_event_base_loop (event.c:1630)
2,2,    epoll_dispatch (epoll.c:407)
2,2,      epoll_wait
2,2,    poll_dispatch (poll.c:165)
2,2,      poll

The upper process is the process writing to the target, the second process (mpi_fetch_op_local_remote.c:74) is the target performing local updates.

The example code:
mpi_fetch_op_local_remote.tar.gz

Build with:

$ mpicc mpi_fetch_op_local_remote.c -o mpi_fetch_op_local_remote -g

Run with:

$ mpirun -n 2 -N 1 ./mpi_fetch_op_local_remote

Things work without problems using the OpenIB adapter.

Please let me know if I can provide more information. I hope the reproducer is helpful for someone.

@jladd-mlnx
Copy link
Member

@vspetrov can you please take a look.

@hoopoepg
Copy link
Contributor

hoopoepg commented Apr 16, 2019

@devreal @vspetrov could you try UCX v1.6 branch?
we fixed few issues there which could affect this hang

thank you

@vspetrov
Copy link

@devreal I was not able to reproduce the hang with OMPI v4.0.x (latest UCX-v1.5.x OR UCX-v1.6.x) neither with OMPI-master (again both versions of UCX, 1.5.x and 1.6.x). However, the test revealed a bug in osc_ucx_fetch_and_op (both in ompi-master and v4.0.x). The was a crash on NULL origin_addr with MPI_NO_OP (latest MPI_Fetch_and_op in the test case, right after the main loop). The fixes for that issue for master and v4.0.x are ready: #6599 and #6600.

In order to clarify the hang:
@devreal, what do you mean by "Things work without problems using the OpenIB adapter" ? What is your cmd line that pass?

@devreal
Copy link
Contributor Author

devreal commented Apr 17, 2019

@vspetrov Thanks for looking into this issue. I tried again with #6600 but with no luck, both processes still are stuck. The fetch-and-op after the loop is never actually reached.

If I run with a version Open MPI built without UCX support the test runs fine:

mpirun -mca btl_openib_allow_ib 1 -n 2 -N 1 ./mpi_fetch_op_local_remote

I will try again with master and UCX 1.6 and report back.

@devreal
Copy link
Contributor Author

devreal commented Apr 17, 2019

I see the same behavior for master and UCX 1.6. Any chance I can get more information out of Open MPI and/or UCX? I tried configuring Open MPI with --with-debug and increasing verbosity with -mca osc_base_verbose 100 -mca osc_ucx_verbose 100 but nothing interesting comes up...

@vspetrov
Copy link

@devreal could you please run with "debug" UCX 1.6 and "-x UCX_LOG_LEVEL=debug. How many HCA's each node has? Could you also try setting the UCX_NET_DEVICES explicitly (e.g. UCX_NET_DEVICES=mlx5_0:1).

@devreal
Copy link
Contributor Author

devreal commented Apr 18, 2019

Here is the config summary (not sure the line IB modules: < > is something to worry about):

$ ../configure CC=icc CXX=icpc --enable-debug
[...]
configure: =========================================================
configure: UCX build configuration:
configure:       Multi-thread:   Disabled
configure:          MPI tests:   Enabled
configure:        UCT modules:   < ib rdmacm cma >
configure:       CUDA modules:   < >
configure:       ROCM modules:   < >
configure:         IB modules:   < >
configure:        UCM modules:   < >
configure:       Perf modules:   < >
configure: =========================================================

There is only one HCA per node:

$ mpirun -n 1 ibstat
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 1
	Firmware version: 2.42.5000
	Hardware version: 1
	Node GUID: 0xf4521403005a60a0
	System image GUID: 0xf4521403005a60a3
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 291
		LMC: 0
		SM lid: 507
		Capability mask: 0x02594868
		Port GUID: 0xf4521403005a60a1
		Link layer: InfiniBand

I'm attaching the output if run with the command line mpirun -n 2 -N 1 -x UCX_LOG_LEVEL=debug -x UCX_NET_DEVICES="mlx4_0:1" ./mpi_fetch_op_local_remote:
ompi_ucx_fetch_op.log.tar.gz

Here is another backtrace taken while the processes are locked up:

Processes,Threads,Function
1,1,main (mpi_fetch_op_local_remote.c:55)
1,1,  PMPI_Fetch_and_op (pfetch_and_op.c:73)
1,1,    ompi_osc_ucx_fetch_and_op (osc_ucx_comm.c:639)
1,1,      end_atomicity (osc_ucx_comm.c:262)
1,1,        opal_common_ucx_wpmem_fetch (common_ucx_wpool.h:477)
1,1,          opal_common_ucx_atomic_fetch (common_ucx.h:186)
1,1,            opal_common_ucx_wait_request (common_ucx.h:141)
1,1,              ucp_worker_progress (ucp_worker.c:1632)
1,1,                uct_worker_progress (uct.h:1806)
1,1,                  ucs_callbackq_dispatch (callbackq.h:211)
1,1,                    uct_mm_iface_progress (mm_iface.c:263)
1,1,                      uct_mm_iface_poll_fifo (mm_iface.c:230)
1,1,main (mpi_fetch_op_local_remote.c:74)
1,1,  PMPI_Fetch_and_op (pfetch_and_op.c:73)
1,1,    ompi_osc_ucx_fetch_and_op (osc_ucx_comm.c:610)
1,1,      start_atomicity (osc_ucx_comm.c:244)
1,1,        opal_common_ucx_wpmem_cmpswp (common_ucx_wpool.h:401)
1,1,          opal_common_ucx_atomic_cswap (common_ucx.h:209)
1,1,            opal_common_ucx_atomic_fetch (common_ucx.h:184)
1,1,              ucp_atomic_fetch_nb (amo_send.c:145)
1,1,                ucp_rma_send_request_cb (rma.inl:20)
1,1,                  ucp_request_send (ucp_request.inl:204)
1,1,                    ucp_request_try_send (ucp_request.inl:169)
1,1,                      ucp_amo_sw_progress_fetch (amo_sw.c:84)
1,1,                        ucp_amo_sw_progress (amo_sw.c:59)
1,1,                          uct_ep_am_bcopy (uct.h:2021)
1,1,                            uct_self_ep_am_bcopy (self.c:283)
1,1,                              uct_self_iface_sendrecv_am (self.c:130)
1,1,                                uct_iface_invoke_am (uct_iface.h:596)
1,1,                                  ucp_atomic_req_handler (amo_sw.c:225)
1,1,                                    ucp_amo_sw_do_fop64 (amo_sw.c:191)
2,2,progress_engine (opal_progress_threads.c:105)
2,2,  opal_libevent2022_event_base_loop (event.c:1630)
2,2,    poll_dispatch (poll.c:165)
2,2,      poll
2,2,progress_engine (pmix_progress_threads.c:230)
2,2,  opal_libevent2022_event_base_loop (event.c:1630)
2,2,    epoll_dispatch (epoll.c:407)
2,2,      epoll_wait
2,2,ucs_async_thread_func (thread.c:93)
2,2,  epoll_wait

Interestingly, if I resume the execution and get another backtrace the first process seems locked in ucs_callbackq_dispatch (although that's just two samples) while the second process has changed it's stack somewhere in UCX:

Processes,Threads,Function
1,1,main (mpi_fetch_op_local_remote.c:74)
1,1,  PMPI_Fetch_and_op (pfetch_and_op.c:73)
1,1,    ompi_osc_ucx_fetch_and_op (osc_ucx_comm.c:610)
1,1,      start_atomicity (osc_ucx_comm.c:244)
1,1,        opal_common_ucx_wpmem_cmpswp (common_ucx_wpool.h:401)
1,1,          opal_common_ucx_atomic_cswap (common_ucx.h:209)
1,1,            opal_common_ucx_atomic_fetch (common_ucx.h:184)
1,1,              ucp_atomic_fetch_nb (amo_send.c:145)
1,1,                ucp_rma_send_request_cb (rma.inl:20)
1,1,                  ucp_request_send (ucp_request.inl:204)
1,1,                    ucp_request_try_send (ucp_request.inl:169)
1,1,                      ucp_amo_sw_progress_fetch (amo_sw.c:84)
1,1,                        ucp_amo_sw_progress (amo_sw.c:59)
1,1,                          uct_ep_am_bcopy (uct.h:2021)
1,1,                            uct_self_ep_am_bcopy (self.c:283)
1,1,                              uct_self_iface_sendrecv_am (self.c:130)
1,1,                                uct_iface_invoke_am (uct_iface.h:596)
1,1,                                  ucp_atomic_req_handler (amo_sw.c:235)
1,1,                                    ucp_request_send (ucp_request.inl:204)
1,1,                                      ucp_request_try_send (ucp_request.inl:169)
1,1,                                        ucp_progress_atomic_reply (amo_sw.c:121)
1,1,                                          uct_ep_am_bcopy (uct.h:2021)
1,1,                                            uct_self_ep_am_bcopy (self.c:283)
1,1,                                              uct_self_iface_sendrecv_am (self.c:133)
1,1,                                                ucs_mpool_put_inline (mpool.inl:79)
1,1,                                                  ucs_mpool_add_to_freelist (mpool.inl:56)

HTH. Please let me know if I can provide anything else :)

@vspetrov
Copy link

Couldn't find anything suspicious in the log at this log level. @devreal , could we try again with UCX_LOG_LEVEL=data (will be much more). Btw, i see that you are using mlx4_0, i assume it is ConnectX-3. Do you have another HCA (mlx5) to try your build? I'm asking since i was not able to reproduce the hang on our setups. But those are all mlx5 now. So, i'm just wondering if this is related to CX3.

@devreal
Copy link
Contributor Author

devreal commented Apr 24, 2019

I have tried it on a ConnextX-4 node in the same system with the same build (mlx5):

$ mpirun -n 1 ibstat
CA 'mlx5_0'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.24.1000
	Hardware version: 0
	Node GUID: 0x506b4b0300ddf066
	System image GUID: 0x506b4b0300ddf066
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 372
		LMC: 0
		SM lid: 507
		Capability mask: 0x2659e848
		Port GUID: 0x506b4b0300ddf066
		Link layer: InfiniBand

I'm attaching the output gathered with UCX_LOG_LEVEL=data. Interestingly, the application makes some progress if the data verbosity is enabled. I observed up to 100 iterations, most of the time less than that. This one was 10 iterations before it got stuck at which point I aborted the run because UCX continues writing output, filling up the log. There should be output from the locked-up state in it though.
output.tar.gz

@vspetrov
Copy link

vspetrov commented May 6, 2019

Possibly related to sw atomic implementation in UCX. The hang is reproduced with "UCX_TLS=self,ud". Not reproduced with self,rc. Strangely, it is also not reproduced with UCX_TLS=ud (no "self").

@devreal
Copy link
Contributor Author

devreal commented May 30, 2019

@yosefe Thanks for the fix (strangely I didn't get an email notification). I will give it a try as soon as possible. Since this problem occurs on master too there should probably be a PR against master as well?

@yosefe
Copy link
Contributor

yosefe commented May 31, 2019

@devreal yes - #6687

@gpaulsen
Copy link
Member

This has been cherry-picked into both master and v4.0.x.
Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants