Description
The "new" TCP wireup code introduced in r17450 fails on !BigRed (PPC64), see [http://www.open-mpi.org/mtt/index.php?do_redir=846 MTT-permalink]. As best I can tell, the problem is caused by the IP alias setup on !BigRed's global ethernet device (eth1 and eth1:1). To run over ethernet on !BigRed you need to use these MCA parameters, since the other ethernet device is only wired to other nodes within a single rack:
-mca oob_tcp_include eth1
-mca pml ob1
-mca btl tcp,self
-mca btl_tcp_if_include eth1
The above works on the 1.2 branch, and the trunk prior to r17450. If we get an allocation within a single rack, you can successfully use -mca btl_tcp_if_include eth0
on any OMPI version.
Also, things work if we use the IP over Myrinet via -mca btl_tcp_if_include myri0
.
Here is the output of /sbin/ifconfig
on one of the compute nodes:
eth0 Link encap:Ethernet HWaddr 00:11:25:C9:23:96
inet addr:10.1.2.156 Bcast:10.1.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:107672277 errors:0 dropped:0 overruns:0 frame:0
TX packets:38001239 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12364537258 (11791.7 Mb) TX bytes:5680957916 (5417.7 Mb)
Interrupt:33 Memory:a0030000-a0040000
eth1 Link encap:Ethernet HWaddr 00:11:25:C9:23:97
inet addr:10.2.2.156 Bcast:10.2.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:263224319 errors:0 dropped:0 overruns:0 frame:0
TX packets:164937792 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1801179733170 (1717738.8 Mb) TX bytes:158800164724 (151443.6 Mb)
Interrupt:34 Memory:a0010000-a0020000
eth1:1 Link encap:Ethernet HWaddr 00:11:25:C9:23:97
inet addr:149.165.233.59 Bcast:149.165.233.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
Interrupt:34 Memory:a0010000-a0020000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:242182780 errors:0 dropped:0 overruns:0 frame:0
TX packets:242182780 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:568226732655 (541903.2 Mb) TX bytes:568226732655 (541903.2 Mb)
myri0 Link encap:Ethernet HWaddr 00:60:DD:47:D7:1E
inet addr:10.4.2.156 Bcast:10.4.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:5693807 errors:0 dropped:0 overruns:0 frame:0
TX packets:5775378 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:16170960666 (15421.8 Mb) TX bytes:17390161910 (16584.5 Mb)
Interrupt:40
This is a regression from the 1.2 branch, thus I mark this as critical.