Skip to content

TCP BTL wireup fails when the networking is "strange", as seen on BigRed. #57

Closed
@ompiteam

Description

@ompiteam

The "new" TCP wireup code introduced in r17450 fails on !BigRed (PPC64), see [http://www.open-mpi.org/mtt/index.php?do_redir=846 MTT-permalink]. As best I can tell, the problem is caused by the IP alias setup on !BigRed's global ethernet device (eth1 and eth1:1). To run over ethernet on !BigRed you need to use these MCA parameters, since the other ethernet device is only wired to other nodes within a single rack:

-mca oob_tcp_include eth1
-mca pml ob1
-mca btl tcp,self
-mca btl_tcp_if_include eth1

The above works on the 1.2 branch, and the trunk prior to r17450. If we get an allocation within a single rack, you can successfully use -mca btl_tcp_if_include eth0 on any OMPI version.
Also, things work if we use the IP over Myrinet via -mca btl_tcp_if_include myri0.

Here is the output of /sbin/ifconfig on one of the compute nodes:

eth0      Link encap:Ethernet  HWaddr 00:11:25:C9:23:96  
          inet addr:10.1.2.156  Bcast:10.1.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:107672277 errors:0 dropped:0 overruns:0 frame:0
          TX packets:38001239 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:12364537258 (11791.7 Mb)  TX bytes:5680957916 (5417.7 Mb)
          Interrupt:33 Memory:a0030000-a0040000 

eth1      Link encap:Ethernet  HWaddr 00:11:25:C9:23:97  
          inet addr:10.2.2.156  Bcast:10.2.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:263224319 errors:0 dropped:0 overruns:0 frame:0
          TX packets:164937792 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1801179733170 (1717738.8 Mb)  TX bytes:158800164724 (151443.6 Mb)
          Interrupt:34 Memory:a0010000-a0020000 

eth1:1    Link encap:Ethernet  HWaddr 00:11:25:C9:23:97  
          inet addr:149.165.233.59  Bcast:149.165.233.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          Interrupt:34 Memory:a0010000-a0020000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:242182780 errors:0 dropped:0 overruns:0 frame:0
          TX packets:242182780 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:568226732655 (541903.2 Mb)  TX bytes:568226732655 (541903.2 Mb)

myri0     Link encap:Ethernet  HWaddr 00:60:DD:47:D7:1E  
          inet addr:10.4.2.156  Bcast:10.4.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:5693807 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5775378 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:16170960666 (15421.8 Mb)  TX bytes:17390161910 (16584.5 Mb)
          Interrupt:40 

This is a regression from the 1.2 branch, thus I mark this as critical.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions