-
Notifications
You must be signed in to change notification settings - Fork 900
Deadlock with UCX when performing MPI_Fetch_and_op #6546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@vspetrov can you please take a look. |
@devreal I was not able to reproduce the hang with OMPI v4.0.x (latest UCX-v1.5.x OR UCX-v1.6.x) neither with OMPI-master (again both versions of UCX, 1.5.x and 1.6.x). However, the test revealed a bug in osc_ucx_fetch_and_op (both in ompi-master and v4.0.x). The was a crash on NULL origin_addr with MPI_NO_OP (latest MPI_Fetch_and_op in the test case, right after the main loop). The fixes for that issue for master and v4.0.x are ready: #6599 and #6600. In order to clarify the hang: |
@vspetrov Thanks for looking into this issue. I tried again with #6600 but with no luck, both processes still are stuck. The fetch-and-op after the loop is never actually reached. If I run with a version Open MPI built without UCX support the test runs fine:
I will try again with |
I see the same behavior for |
@devreal could you please run with "debug" UCX 1.6 and "-x UCX_LOG_LEVEL=debug. How many HCA's each node has? Could you also try setting the UCX_NET_DEVICES explicitly (e.g. UCX_NET_DEVICES=mlx5_0:1). |
Here is the config summary (not sure the line
There is only one HCA per node:
I'm attaching the output if run with the command line Here is another backtrace taken while the processes are locked up:
Interestingly, if I resume the execution and get another backtrace the first process seems locked in
HTH. Please let me know if I can provide anything else :) |
Couldn't find anything suspicious in the log at this log level. @devreal , could we try again with UCX_LOG_LEVEL=data (will be much more). Btw, i see that you are using mlx4_0, i assume it is ConnectX-3. Do you have another HCA (mlx5) to try your build? I'm asking since i was not able to reproduce the hang on our setups. But those are all mlx5 now. So, i'm just wondering if this is related to CX3. |
I have tried it on a ConnextX-4 node in the same system with the same build (mlx5):
I'm attaching the output gathered with |
Possibly related to sw atomic implementation in UCX. The hang is reproduced with "UCX_TLS=self,ud". Not reproduced with self,rc. Strangely, it is also not reproduced with UCX_TLS=ud (no "self"). |
@yosefe Thanks for the fix (strangely I didn't get an email notification). I will give it a try as soon as possible. Since this problem occurs on master too there should probably be a PR against master as well? |
This has been cherry-picked into both master and v4.0.x. |
I have configured Open MPI 4.0.1 using Open UCX 1.5 and with IB verbs disabled (both Open MPI and Open UCX were compiled with
--with-debug
). I'm running a benchmark that performs a bunch ofMPI_Fetch_and_op
, with the target rank waiting for all operations by other ranks to finish (by waiting for an ibarrier) and randomly performing local updates usingMPI_Fetch_and_op
. I'm attaching the code, it's the same that is used in #6536Running with 2 ranks on 2 nodes of our IB cluster, the application gets stuck in the first
MPI_Fetch_and_op
. DDT reports:The upper process is the process writing to the target, the second process (
mpi_fetch_op_local_remote.c:74
) is the target performing local updates.The example code:
mpi_fetch_op_local_remote.tar.gz
Build with:
Run with:
Things work without problems using the OpenIB adapter.
Please let me know if I can provide more information. I hope the reproducer is helpful for someone.
The text was updated successfully, but these errors were encountered: