Skip to content

grpcomm, can't relay #7100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lisalenorelowe opened this issue Oct 22, 2019 · 5 comments
Closed

grpcomm, can't relay #7100

lisalenorelowe opened this issue Oct 22, 2019 · 5 comments
Labels

Comments

@lisalenorelowe
Copy link

lisalenorelowe commented Oct 22, 2019

I am having an issue that appears to be the same as this issue:
#4416

The error message is copy/pasted below the CPU info.

Version is Open MPI 4.0.0 installed with gcc4.8.5

It was installed like this:

 ./configure --prefix=<absolute path to location of installation> --enable-static --with-lsf=/usr/local/lsf/10.1 --with-lsf-libdir=/usr/local/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
make all install

But I don't have the information about how it was downloaded.

I'm running on an HPC cluster with CentOS Linux release 7.7.1908, and the error occurred on a compute node with Intel(R) Xeon(R) CPU E5520 @ 2.27GHz and has ethernet.

We use LSF. The code is simple MPI to bundle serial jobs, so it basically just reads in a list of commands and sends them to each processor, and they call a system command. The code works most of the time...this message is not typical of runs. Code is compiled with same environment as Open MPI is compiled with, and run with normal mpirun. This run was with 50 MPI tasks and spanned 7 nodes.

Error message:

[n3o3-5:11224] [[59690,0],0] grpcomm:direct:send_relay proc [[59690,0],4] not running - cannot relay: NOT ALIVE 
User defined signal 2

An internal error has occurred in ORTE:
[[59690,0],0] FORCE-TERMINATE AT Unreachable:-12 - error grpcomm_direct.c(590)
This is something that should be reported to the developers.
@jsquyres
Copy link
Member

Have you ever been able to run Open MPI successfully on this cluster?

Also, could you try the recently released v4.0.2? It contains a bunch of good bug fixes.

@lisalenorelowe
Copy link
Author

lisalenorelowe commented Oct 22, 2019 via email

@jsquyres
Copy link
Member

We run Open MPI successfully on the cluster all the time. This is a one-time error so far (as much as I know).

Good to know.

We can try 4.0.2, but since we have installed software, created modules, and wrote documentation for using Open MPI 4.0, it would be very helpful to know what is going on with that one. We have a large, shared cluster, and it is not trivial to upgrade everything immediately to a newer MPI.

Gotcha. It may not be worth it, then.

For your own edification, v4.0.2 is just bug fixes compared to v4.0.0. It's also ABI-compatible with v4.0.0. Hypothetically, that means that you could swap out v4.0.2 behind the scenes and not need to recompile anything or even notify users (no documentation should need to change).

More specifically: if you care to test it, you can just put a v4.0.2 install somewhere (i.e., not fully deploy it to everyone -- but install it somewhere for your personal testing), update your PATH/LD_LIBRARY_PATH (and verify via ldd or somesuch that you're actually linking against the v4.0.2 install), and test out v4.0.2 against your existing applications.

But if this is a one-off error -- I'm afraid I don't have any specific data as to what exactly could have gone wrong here. I see you referred to #4416: I don't have any information more than what the user posted there about what could have been a "misconfiguration" on their cluster. The usual suspects are firewalls/iptables, etc. Sorry I can't do any better than that! 😦

@lisalenorelowe
Copy link
Author

lisalenorelowe commented Oct 22, 2019 via email

@jsquyres
Copy link
Member

I'll mark this as closed for now. Reply on here if the issue starts happening more frequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants