Skip to content

Add/remove OB1 and CUDA progress. #5073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented Apr 14, 2018

Provide support for dynamically adding and removing the progress
function for OB1 and CUDA.

This will provide a fix for #4650.

Signed-off-by: George Bosilca [email protected]

Provide support for dynamically adding and removing the progress
function for OB1 and CUDA.

Signed-off-by: George Bosilca <[email protected]>
@Akshay-Venkatesh
Copy link
Contributor

Hi, George. Responding late. Please accept my apologies. I see an assertion failure with the patch. Seems like there are fewer increment operations than decrement operations.

mpirun -np 2 --hostfile /home/akvenkatesh/osu-micro-benchmarks/build-hsw/hostfile --mca btl vader,self,smcuda,openib /home/akvenkatesh/osu-micro-benchmarks/build-hsw/get_local_ompi_rank /home/akvenkatesh/osu-micro-benchmarks/build-hsw/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency D D
# OSU MPI-CUDA Latency Test v5.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       1.73
1                      18.47
2                      18.49
4                      18.77
8                      18.45
16                     18.58
32                     18.48
64                     18.49
128                    19.80
256                    19.83
512                    19.26
1024                   20.38
2048                   21.00
4096                   22.81
8192                   24.99
osu_latency: ../../../../../ompi/mca/pml/ob1/pml_ob1_progress.c:64: mca_pml_ob1_enable_progress: Assertion `progress_count >= 0' failed.
[hsw226:07774] *** Process received signal ***
[hsw226:07774] Signal: Aborted (6)
[hsw226:07774] Signal code:  (-6)
[hsw226:07774] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaac28a370]
[hsw226:07774] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaac4cc1d7]
[hsw226:07774] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aaaac4cd8c8]
[hsw226:07774] [ 3] /lib64/libc.so.6(+0x2e146)[0x2aaaac4c5146]
[hsw226:07774] [ 4] /lib64/libc.so.6(+0x2e1f2)[0x2aaaac4c51f2]
[hsw226:07774] [ 5] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_enable_progress+0x54)[0x2aaaeb2e6322]
[hsw226:07774] [ 6] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_progress+0x1b5)[0x2aaaeb2e6523]
[hsw226:07774] [ 7] /home/akvenkatesh/openmpi/bosilca/build/lib/libopen-pal.so.0(opal_progress+0x30)[0x2aaaad20f4af]
[hsw226:07774] [ 8] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(+0xcb6a)[0x2aaaeb2deb6a]
[hsw226:07774] [ 9] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x362)[0x2aaaeb2dffe9]
[hsw226:07774] [10] /home/akvenkatesh/openmpi/bosilca/build/lib/libmpi.so.0(MPI_Recv+0x2c0)[0x2aaaabf8592e]
[hsw226:07774] [11] /home/akvenkatesh/osu-micro-benchmarks/build-hsw/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x40155d]
[hsw226:07774] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac4b8b35]
[hsw226:07774] [13] /home/akvenkatesh/osu-micro-benchmarks/build-hsw/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x401189]
[hsw226:07774] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 7774 on node hsw226 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/6e3054f922a9b26ba73f13daa04ffa03

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/647cfc633c9868dc2823da2213996842

@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/157c0bbcb101530d4cc1fe17246e433e

@jjhursey
Copy link
Member

bot:ibm:retest (CI script broke, should be fixed now)

@awlauria awlauria added this to the v5.0.0 milestone Mar 19, 2020
@awlauria
Copy link
Contributor

@bosilca were you able to look at the failure?

@lanl-ompi
Copy link
Contributor

Can one of the admins verify this patch?

@gpaulsen
Copy link
Member

gpaulsen commented Mar 2, 2021

@bosilca @jladd-mlnx - What's the fate of this PR?

@gpaulsen gpaulsen removed this from the v5.0.0 milestone Aug 26, 2021
@gpaulsen
Copy link
Member

@bosilca Is this something we could revive for HAN for v5.0.0?

@bosilca
Copy link
Member Author

bosilca commented Aug 30, 2022

Not for HAN. With the new accelerator framework I am not sure if this is still relevant.

@janjust
Copy link
Contributor

janjust commented May 19, 2025

This seems like really stale, closing, please reopen if needed

@janjust janjust closed this May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants