Skip to content

UCX Fail in osu_put_bibw #9339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lappazos opened this issue Aug 31, 2021 · 7 comments
Closed

UCX Fail in osu_put_bibw #9339

lappazos opened this issue Aug 31, 2021 · 7 comments

Comments

@lappazos
Copy link

openucx/ucc#284

@gpaulsen
Copy link
Member

Is this a dup of #8086? Or more specific to UCX?

@jsquyres
Copy link
Member

Can you please report the full details of the bug here instead of just a link to a closed UCX bug?

@awlauria
Copy link
Contributor

From the linked defect, it looks like a possibly related osc/ucx failure with MPI_Win_post()

Configuration

OMPI: v5.0.0a1
MOFED: MLNX_OFED_LINUX-5.4-1.0.3.0
Module: none
Test module: none
Nodes: jazz x12 (ppn=28(x12), nodelist=jazz[12-21,29-30])
 
MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucc/20210802_170250_189628_54877_jazz12.swx.labs.mlnx/html/test_stdout_Ccnq4w.txt
 
Cmd:
/hpc/mtr_scrap/users/mtt/scratch/ucc/20210802_170250_189628_54877_jazz12.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 2 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 -x UCC_TL_UCP_TUNE=allreduce:1 --map-by node --bind-to core /hpc/mtr_scrap/users/mtt/scratch/ucc/20210802_170250_189628_54877_jazz12.swx.labs.mlnx/installs/OSS4/tests/osu_micro_benchmark/osu-micro-benchmarks-5.6.2/mpi/one-sided/osu_put_bibw
 
Output:

# OSU MPI_Put Bi-directional Bandwidth Test v5.6.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_post/start/complete/wait
# Size      Bandwidth (MB/s)[jazz13.swx.labs.mlnx:163928] ../../../../opal/mca/common/ucx/common_ucx_wpool.h:526  Error: ucp_atomic_cswap64 failed: -1
[jazz13:00000] *** An error occurred in MPI_Win_post
[jazz13:00000] *** reported by process [1362690049,1]
[jazz13:00000] *** on win ucx window 3
[jazz13:00000] *** MPI_ERR_OTHER: known error not in list
[jazz13:00000] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[jazz13:00000] ***    and MPI will try to terminate your MPI job as well)+ rc=16

@jsquyres jsquyres added this to the v5.0.0 milestone Oct 7, 2021
@jsquyres
Copy link
Member

jsquyres commented Oct 7, 2021

FYI @open-mpi/ucx

@awlauria @bwbarrett Is this part of the same bucket of one-sided issues that were discussed on the call this past Tuesday?

@jsquyres jsquyres changed the title Fail in osu_put_bibw UCX Fail in osu_put_bibw Nov 26, 2021
@gpaulsen
Copy link
Member

gpaulsen commented Mar 3, 2022

@janjust Is this fixed yet?

@janjust
Copy link
Contributor

janjust commented Mar 3, 2022

@gpaulsen not yet - it's an issue with the MPI_Win_post/start/complete/wait synchronization.
@MamziB is on it

@janjust
Copy link
Contributor

janjust commented Mar 16, 2022

Fixed with: #10126

@janjust janjust closed this as completed Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants