-
Notifications
You must be signed in to change notification settings - Fork 900
grpcomm, can't relay #7100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Have you ever been able to run Open MPI successfully on this cluster? Also, could you try the recently released v4.0.2? It contains a bunch of good bug fixes. |
We run Open MPI successfully on the cluster all the time. This is a
one-time error so far (as much as I know).
We can try 4.0.2, but since we have installed software, created modules,
and wrote documentation for using Open MPI 4.0, it would be very helpful to
know what is going on with that one. We have a large, shared cluster, and
it is not trivial to upgrade everything immediately to a newer MPI.
…On Tue, Oct 22, 2019 at 8:59 AM Jeff Squyres ***@***.***> wrote:
Have you ever been able to run Open MPI successfully on this cluster?
Also, could you try the recently released v4.0.2? It contains a bunch of
good bug fixes.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7100?email_source=notifications&email_token=AFO26S46PUEADJRVDFRVSY3QP32MVA5CNFSM4JDLDVLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB5UMIQ#issuecomment-544949794>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFO26S7VEGG6BSEOTFESGHDQP32MVANCNFSM4JDLDVLA>
.
--
Lisa L. Lowe, PhD
Resume: https://sites.google.com/site/lisallowephd
Friends of Former Yugoslavia:
http://www.meetup.com/Cary-Serbo-Croatian-Language-Meetup
|
Good to know.
Gotcha. It may not be worth it, then. For your own edification, v4.0.2 is just bug fixes compared to v4.0.0. It's also ABI-compatible with v4.0.0. Hypothetically, that means that you could swap out v4.0.2 behind the scenes and not need to recompile anything or even notify users (no documentation should need to change). More specifically: if you care to test it, you can just put a v4.0.2 install somewhere (i.e., not fully deploy it to everyone -- but install it somewhere for your personal testing), update your PATH/LD_LIBRARY_PATH (and verify via But if this is a one-off error -- I'm afraid I don't have any specific data as to what exactly could have gone wrong here. I see you referred to #4416: I don't have any information more than what the user posted there about what could have been a "misconfiguration" on their cluster. The usual suspects are firewalls/iptables, etc. Sorry I can't do any better than that! 😦 |
Okay - thanks for responding, and for letting me know that switching to
4.0.2 won't break anything.
…On Tue, Oct 22, 2019 at 12:48 PM Jeff Squyres ***@***.***> wrote:
We run Open MPI successfully on the cluster all the time. This is a
one-time error so far (as much as I know).
Good to know.
We can try 4.0.2, but since we have installed software, created modules,
and wrote documentation for using Open MPI 4.0, it would be very helpful to
know what is going on with that one. We have a large, shared cluster, and
it is not trivial to upgrade everything immediately to a newer MPI.
Gotcha. It may not be worth it, then.
For your own edification, v4.0.2 is just bug fixes compared to v4.0.0.
It's also ABI-compatible with v4.0.0. Hypothetically, that means that you
could swap out v4.0.2 behind the scenes and not need to recompile anything
or even notify users (no documentation should need to change).
More specifically: if you care to test it, you can just put a v4.0.2
install somewhere (i.e., not fully deploy it to everyone -- but install it
somewhere for your personal testing), update your PATH/LD_LIBRARY_PATH (and
verify via ldd or somesuch that you're actually linking against the
v4.0.2 install), and test out v4.0.2 against your existing applications.
But if this is a one-off error -- I'm afraid I don't have any specific
data as to what exactly could have gone wrong here. I see you referred to
#4416 <#4416>: I don't have any
information more than what the user posted there about what could have been
a "misconfiguration" on their cluster. The usual suspects are
firewalls/iptables, etc. Sorry I can't do any better than that! 😦
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7100?email_source=notifications&email_token=AFO26S5MKVUISTOPWVZHJYLQP4VEJA5CNFSM4JDLDVLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB6NT3Q#issuecomment-545053166>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFO26SY6NRB7M6XTCH44MJTQP4VEJANCNFSM4JDLDVLA>
.
--
Lisa L. Lowe, PhD
Resume: https://sites.google.com/site/lisallowephd
Friends of Former Yugoslavia:
http://www.meetup.com/Cary-Serbo-Croatian-Language-Meetup
|
I'll mark this as closed for now. Reply on here if the issue starts happening more frequently. |
I am having an issue that appears to be the same as this issue:
#4416
The error message is copy/pasted below the CPU info.
Version is Open MPI 4.0.0 installed with gcc4.8.5
It was installed like this:
But I don't have the information about how it was downloaded.
I'm running on an HPC cluster with CentOS Linux release 7.7.1908, and the error occurred on a compute node with Intel(R) Xeon(R) CPU E5520 @ 2.27GHz and has ethernet.
We use LSF. The code is simple MPI to bundle serial jobs, so it basically just reads in a list of commands and sends them to each processor, and they call a system command. The code works most of the time...this message is not typical of runs. Code is compiled with same environment as Open MPI is compiled with, and run with normal mpirun. This run was with 50 MPI tasks and spanned 7 nodes.
Error message:
The text was updated successfully, but these errors were encountered: