-
Notifications
You must be signed in to change notification settings - Fork 901
TCP BTL wireup fails when the networking is "strange", as seen on BigRed. #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Imported from trac issue 1505. Created by timattox on 2008-09-17T16:00:55, last modified: 2010-01-26T11:17:50 |
Trac comment by timattox on 2008-09-23 11:43:27: Since we (the developers) don't have other machines available to test my theory of IP aliasing as the cause of the failure, we are dropping this to just a major. Hopefully I will get a chance to walk-thru the code to see how it is failing before we release 1.3. |
Trac comment by timattox on 2008-09-24 16:00:36: Note to whoever looks into this: opal_ifinit() in opal/util/if.c may be a place to start adding some debugging output... although it wasn't changed by r17450. |
Trac comment by timattox on 2008-10-22 14:58:31: I won't have time to work on this before we release 1.3, so moving it to 1.3.1 |
Trac comment by jsquyres on 2008-11-11 13:07:10: FWIW, Jon Mason at Chelsio ran on a cluster with 2 IP aliases and it all seemed to work fine. He ran IMB-MPI1 and it worked fine for him... Here's the interfaces:
|
Trac comment by timattox on 2009-02-10 16:00:15: Based on the previous comment, the title of this ticket is probably wrong. For our !BigRed machine, a workaround is to use |
Trac comment by bosilca on 2009-07-07 22:34:45: Now that we have the ability to define more precisely what a private IP address is (see #1821) I wonder if this cannot be fixed by providing the correct private addresses on the nodes. I don't have access to !BigRed, but if somebody can test the following MCA parameter this might allow us to close this ticket. --mca opal_net_private_ipv4 "10.1.0.0/16;10.2.0.0/16;172.16.0.0/12;192.168.0.0/16;169.254.0.0/16" |
Trac comment by jjhursey on 2010-01-26 11:17:50: Will most likely not have time to investigate this in the near term. Moving to Future so it does not get lost. |
Disable sendi optimization for GPU buffers
With #7134, this should no longer be a problem. Since BigRed has been dead for a decade, there's no good way to test this specific bug. I'm going to close this ticket, since we believe we're fixed and can't prove it either way. |
The "new" TCP wireup code introduced in r17450 fails on !BigRed (PPC64), see [http://www.open-mpi.org/mtt/index.php?do_redir=846 MTT-permalink]. As best I can tell, the problem is caused by the IP alias setup on !BigRed's global ethernet device (eth1 and eth1:1). To run over ethernet on !BigRed you need to use these MCA parameters, since the other ethernet device is only wired to other nodes within a single rack:
The above works on the 1.2 branch, and the trunk prior to r17450. If we get an allocation within a single rack, you can successfully use
-mca btl_tcp_if_include eth0
on any OMPI version.Also, things work if we use the IP over Myrinet via
-mca btl_tcp_if_include myri0
.Here is the output of
/sbin/ifconfig
on one of the compute nodes:This is a regression from the 1.2 branch, thus I mark this as critical.
The text was updated successfully, but these errors were encountered: