Skip to content

mpirun 5.0.5 - TCP connection failure between hosts with multiple network interfaces #13155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amjal opened this issue Mar 18, 2025 · 9 comments

Comments

@amjal
Copy link

amjal commented Mar 18, 2025

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

$ ompi_info --version
Open MPI v5.0.5
$ prte_info --all | head
                    PRTE: 3.0.6rc12025-03-17
      PRTE repo revision: 2025-03-17
       PRTE release date: @PMIX_RELEASE_DATE@
                    PMIx: OpenPMIx 5.0.3rc1 (PMIx Standard: 4.2, Stable ABI:
                          0.0, Provisional ABI: 0.0)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Downloaded the source code from https://www.open-mpi.org/. Then:

$ ./configure --with-cuda=/usr/local/cuda-12.6 --with-gdrcopy
$ sudo make -j install 
$ sudo ldconfig 

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04.4 LTS

  • Computer hardware: x86_64

  • Network type:
    I have two VM nodes on separate physical machines:

    • node 1 is named ucc-h2:

      • interface ens3, 192.168.122.15/24, is a virtual interface connected to a libvirt bridge
      • interface ens7, 192.168.1.12/24, is a host interface assigned to the node using PCI passthrough.
    • node 2 is named ucc-h5:

      • interface enp5s1, 192.168.122.195/24, is a virtual interface connected to a libvirt bridge
      • interface ens7, 192.168.3.11/24, is a host interface assigned to the node using PCI passthrough.

ens3 and enp5s1 don't ping each other. They are used for management (mostly ssh). The two ens7 interfaces are connected through a router so they can ping each other.

  • Other relevant networking stuff:
    • nc works both ways. Meaning nc -l <port num> on ucc-h2 and nc -N ucc-h2 <port num> on ucc-h5 works fine and vice versa.
    • ssh <hostname> works without requiring password on both hosts.
    • amir@ucc-h2:~$ ip route get 192.168.3.11
         192.168.3.11 via 192.168.1.1 dev ens7 src 192.168.1.12 uid 1000
         cache
    •  amir@ucc-h5:~$ ip route get 192.168.1.12
            192.168.1.12 via 192.168.3.1 dev ens7 src 192.168.3.11 uid 1000
             cache

Yeah, I think that's it. But please let me know if I'm missing something I'll be happy to provide more info.

Details of the problem

I am trying to get mpirun -n 1 --host <hostname> hostname to work on both hosts.

amir@ucc-h2:~$ mpirun -n 1 --host ucc-h5 hostname
ucc-h5

So it's working fine on ucc-h2. But on ucc-h5:

amir@ucc-h5:~$ mpirun -n 1 --host ucc-h2 hostname
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-ucc-h5-196668@0,0] on node ucc-h5
  Remote daemon: [prterun-ucc-h5-196668@0,1] on node ucc-h2

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

It's timing out.

I ran the commands on both ucc-h2 and ucc-h5 with high verbosity to compare them:

 mpirun --mca plm_base_verbose 100 --debug-daemons --prtemca oob_base_verbose 100 -n 1 --host <hostname> hostname

The full outputs are really long so I've included them in separate files in this gist.

But in summary...

In both cases ( the successful one and the failing one) the remote daemon tries to establish connection to the master node using both interfaces. The connection using the wrong interface (ens3@ucc-h2 or enp5s0@ucc-h5) times out after a couple of retries:

prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.122.15:36849 - 1 retries
prte_tcp_peer_try_connect: 192.168.122.15:36849 is down

Then tries the other interface (which is right one):

prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.1.12:36849 - 0 retries

Here is, as far as I could tell, where thing are different depending which node is the master node, causing the asymmetric behaviour.
If the master node is ucc-h2:

[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h2-281015@0,0] on 192.168.1.12:36849 - 0 retries
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] oob:tcp:peer creating socket to [prterun-ucc-h2-281015@0,0]
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] waiting for connect completion to [prterun-ucc-h2-281015@0,0] - activating send event
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp:send_handler called to send to peer [prterun-ucc-h2-281015@0,0]
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp:send_handler CONNECTING
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1]:tcp:complete_connect called for peer [prterun-ucc-h2-281015@0,0] on socket 36
[ucc-h2:281015] [prterun-ucc-h2-281015@0,0] prte_oob_tcp_listen_thread: incoming connection: (40, 0) 192.168.3.11:59759
[ucc-h5:196281] [prterun-ucc-h2-281015@0,1] tcp_peer_complete_connect: sending ack to [prterun-ucc-h2-281015@0,0]

The connection is established.
If the master node is ucc-h5:

[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.3.11:47555 - 0 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] oob:tcp:peer creating socket to [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] waiting for connect completion to [prterun-ucc-h5-195762@0,0] - activating send event
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    ucc-h2
  Remote host:   192.168.3.11
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp:send_handler called to send to peer [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp:send_handler CONNECTING
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]:tcp:complete_connect called for peer [prterun-ucc-h5-195762@0,0] on socket 36
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]-[prterun-ucc-h5-195762@0,0] tcp_peer_complete_connect: connection failed: Connection timed out (110)
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] tcp_peer_close for [prterun-ucc-h5-195762@0,0] sd 36 state CONNECTING
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1]:[oob_tcp_connection.c:1066] connect to [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0]
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on socket -1
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.122.195:47555 - 1 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: 192.168.122.195:47555 is down
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: attempting to connect to proc [prterun-ucc-h5-195762@0,0] on 192.168.3.11:47555 - 1 retries
[ucc-h2:280555] [prterun-ucc-h5-195762@0,1] prte_tcp_peer_try_connect: 192.168.3.11:47555 is down

It finds both addresses down.

I also dug a little in network traffic and found this on ens7@ucc-h2 when ucc-h5 is the master node (the failing case):

Image

Looks like ucc-h2 is trying to talk to ucc-h5 through its ens7 interface but with the source IP of its ens3 interface! I don't have enough experience with networking to know how this could happen. This was actually a surprise to me. I don't know if this is the root cause or is a symptom of another issue.

I know that there are issues on openmpi acting weird when there are multiple interfaces on the host like #5818 and #12232. But I can't find my answer there.

I have tried all sorts of if_include/if_exclude flags on multiple mcas like opal, oob, prte, etc using both interface names and CIDR as parameters. But it's likely I have made a mistake so please let me know how it is properly done, I'm open to suggestions. For example, I tried this which made the most sense to me:

mpirun --mca plm_base_verbose 100 --debug-daemons --prtemca oob_base_verbose 100 --mca oob_tcp_if_exclude 192.168.122.0/24 --prtemca oob_tcp_if_exclude 192.168.122.0/24 -n 1 --host ucc-h2 hostname 

But it didn't change any outcome, it's still trying both interfaces.

This is the furthest I've been able to go. I appreciate any hints or directions for investigating this issue further. I haven't been able to reproduce/isolate it on the network side because all the tools that I know work normally. The issue only appears when using mpirun.

@rhc54
Copy link
Contributor

rhc54 commented Mar 18, 2025

Try something like this:

mpirun --prtemca prte_if_exclude  192.168.122.0/24 --pmixmca ptl_base_if_exclude  192.168.122.0/24 --mca btl_tcp_if_exclude  192.168.122.0/24 ...

The pmixmca entry may not be required, but can't hurt. The PRRTE param name isn't the same as the older OMPI versions, so you may have picked up a stale one.

@amjal
Copy link
Author

amjal commented Mar 18, 2025

Thanks for answering so quickly!

I tried these two:

mpirun --prtemca prte_if_exclude  192.168.122.0/24 --pmixmca ptl_base_if_exclude 192.168.122.0/24 --mca btl_tcp_if_exclude  192.168.122.0/24 --mca plm_base_verbose 100 --debug-daemons --mca oob_base_verbose 100 -n 1 --host ucc-h2 hostname
 mpirun --prtemca prte_if_exclude  192.168.122.0/24 --pmixmca ptl_base_if_exclude 192.168.122.0/24 --mca btl_tcp_if_exclude  192.168.122.0/24 --mca oob_tcp_if_exclude 192.168.122.0/24 --prtemca oob_tcp_if_exclude 192.168.122.0/24 --mca plm_base_verbose 100 --debug-daemons --mca oob_base_verbose 100 -n 1 --host ucc-h2 hostname

In both, lines like these appear:

[ucc-h5:200668] [prte-ucc-h5-200668@0,0] oob:tcp:init adding 192.168.122.195 to our list of V4 connections
[ucc-h5:200668] [prte-ucc-h5-200668@0,0] oob:tcp:init adding 192.168.3.11 to our list of V4 connections

[ucc-h2:285546] [prte-ucc-h5-200668@0,1] oob:tcp:init adding 192.168.122.15 to our list of V4 connections
[ucc-h2:285546] [prte-ucc-h5-200668@0,1] oob:tcp:init adding 192.168.1.12 to our list of V4 connections

@rhc54
Copy link
Contributor

rhc54 commented Mar 18, 2025

I'm looking into it - there definitely is a bug in the include/exclude code path, but it's taking me a bit to track it down.

@rhc54
Copy link
Contributor

rhc54 commented Mar 18, 2025

Okay, I tracked it down - fix is going into upstream master branch. I'll backport it to the release branch - should be in next OMPI release. If you need it sooner, you could look at openpmix/prrte#2170. If you take that diff, you can apply it to the "3rd-party/prrte" directory in the OMPI release. Should fix the problem (did for my test case).

@rhc54
Copy link
Contributor

rhc54 commented Mar 18, 2025

Actually, that patch won't apply. Hang on until I backport it and then that patch will. I'll post it here when ready.

@rhc54
Copy link
Contributor

rhc54 commented Mar 19, 2025

Here is the diff you want - apply it to the "3rd-party/prrte" directory:

diff --git a/src/mca/oob/tcp/oob_tcp_component.c b/src/mca/oob/tcp/oob_tcp_component.c
index e915198f95..b497b3819b 100644
--- a/src/mca/oob/tcp/oob_tcp_component.c
+++ b/src/mca/oob/tcp/oob_tcp_component.c
@@ -21,7 +21,7 @@
  * Copyright (c) 2017      IBM Corporation.  All rights reserved.
  * Copyright (c) 2020      Amazon.com, Inc. or its affiliates.  All Rights
  *                         reserved.
- * Copyright (c) 2021-2024 Nanook Consulting  All rights reserved.
+ * Copyright (c) 2021-2025 Nanook Consulting  All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -386,7 +386,8 @@ static int tcp_component_register(void)
     return PRTE_SUCCESS;
 }

-static char **split_and_resolve(char **orig_str, char *name);
+static void split_and_resolve(char **orig_str, char *name,
+                              char ***interfaces);

 static int component_available(void)
 {
@@ -408,12 +409,12 @@ static int component_available(void)
      * subnet+mask
      */
     if (NULL != prte_if_include) {
-        interfaces = split_and_resolve(&prte_if_include,
-                                       "include");
+        split_and_resolve(&prte_if_include,
+                  "include", &interfaces);
         including = true;
     } else if (NULL != prte_if_exclude) {
-        interfaces = split_and_resolve(&prte_if_exclude,
-                                       "exclude");
+        split_and_resolve(&prte_if_exclude,
+                          "exclude", &interfaces);
     }

     /* if we are the master, then check the interfaces for loopbacks
@@ -504,7 +505,7 @@ static int component_available(void)
                                 pmix_net_get_hostname((struct sockaddr *) &my_ss),
                                 (AF_INET == my_ss.ss_family) ? "V4" : "V6");
             PMIX_ARGV_APPEND_NOSIZE_COMPAT(&prte_mca_oob_tcp_component.ipv4conns,
-                                    pmix_net_get_hostname((struct sockaddr *) &my_ss));
+                                           pmix_net_get_hostname((struct sockaddr *) &my_ss));
         } else if (AF_INET6 == my_ss.ss_family) {
 #if PRTE_ENABLE_IPV6
             pmix_output_verbose(10, prte_oob_base_framework.framework_output,
@@ -513,7 +514,7 @@ static int component_available(void)
                                 pmix_net_get_hostname((struct sockaddr *) &my_ss),
                                 (AF_INET == my_ss.ss_family) ? "V4" : "V6");
             PMIX_ARGV_APPEND_NOSIZE_COMPAT(&prte_mca_oob_tcp_component.ipv6conns,
-                                    pmix_net_get_hostname((struct sockaddr *) &my_ss));
+                                           pmix_net_get_hostname((struct sockaddr *) &my_ss));
 #endif // PRTE_ENABLE_IPV6
         } else {
             pmix_output_verbose(10, prte_oob_base_framework.framework_output,
@@ -547,6 +548,9 @@ static int component_available(void)
         PMIX_ARGV_APPEND_NOSIZE_COMPAT(&prte_mca_oob_tcp_component.if_masks, string);
         pmix_list_append(&prte_mca_oob_tcp_component.local_ifs, &(copied_interface->super));
     }
+    if (NULL != interfaces) {
+        PMIX_ARGV_FREE_COMPAT(interfaces);
+    }

     if (0 == PMIX_ARGV_COUNT_COMPAT(prte_mca_oob_tcp_component.ipv4conns)
 #if PRTE_ENABLE_IPV6
@@ -1091,40 +1095,43 @@ void prte_mca_oob_tcp_component_failed_to_connect(int fd, short args, void *cbda
  * (a.b.c.d/e), resolve them to an interface name (Currently only
  * supporting IPv4).  If unresolvable, warn and remove.
  */
-static char **split_and_resolve(char **orig_str, char *name)
+static void split_and_resolve(char **orig_str, char *name,
+                              char ***interfaces)
 {
     pmix_pif_t *selected_interface;
-    int i, n, ret, match_count, interface_count;
-    char **argv, **interfaces, *str, *tmp;
+    int i, n, ret, match_count;
+    bool found;
+    char **argv, *str, *tmp;
     char if_name[IF_NAMESIZE];
     struct sockaddr_storage argv_inaddr, if_inaddr;
     uint32_t argv_prefix;

     /* Sanity check */
     if (NULL == orig_str || NULL == *orig_str) {
-        return NULL;
+        return;
     }

     argv = PMIX_ARGV_SPLIT_COMPAT(*orig_str, ',');
     if (NULL == argv) {
-        return NULL;
+        return;
     }
-    interface_count = 0;
-    interfaces = NULL;
     for (i = 0; NULL != argv[i]; ++i) {
         if (isalpha(argv[i][0])) {
             /* This is an interface name. If not already in the interfaces array, add it */
-            for (n = 0; n < interface_count; n++) {
-                if (0 == strcmp(argv[i], interfaces[n])) {
-                    break;
+            found = false;
+            if (NULL != interfaces) {
+                for (n = 0; NULL != interfaces[n]; n++) {
+                    if (0 == strcmp(argv[i], *interfaces[n])) {
+                        found = true;
+                        break;
+                    }
                 }
             }
-            if (n == interface_count) {
+            if (!found) {
                 pmix_output_verbose(20,
                                     prte_oob_base_framework.framework_output,
                                     "oob:tcp: Using interface: %s ", argv[i]);
-                PMIX_ARGV_APPEND_NOSIZE_COMPAT(&interfaces, argv[i]);
-                ++interface_count;
+                PMIX_ARGV_APPEND_NOSIZE_COMPAT(interfaces, argv[i]);
             }
             continue;
         }
@@ -1168,29 +1175,33 @@ static char **split_and_resolve(char **orig_str, char *name)
         /* Go through all interfaces and see if we can find a match */
         match_count = 0;
         PMIX_LIST_FOREACH(selected_interface, &pmix_if_list, pmix_pif_t) {
-            pmix_ifindextoaddr(selected_interface->if_kernel_index,
-                               (struct sockaddr*) &if_inaddr,
-                               sizeof(if_inaddr));
-            if (pmix_net_samenetwork((struct sockaddr_storage*) &argv_inaddr,
+            ret = pmix_ifkindextoaddr(selected_interface->if_kernel_index,
+                                     (struct sockaddr*) &if_inaddr,
+                                     sizeof(if_inaddr));
+            if (PMIX_SUCCESS == ret &&
+                pmix_net_samenetwork((struct sockaddr_storage*) &argv_inaddr,
                                      (struct sockaddr_storage*) &if_inaddr,
                                      argv_prefix)) {
                 /* We found a match. If it's not already in the interfaces array,
                    add it. If it's already in the array, treat it as a match */
                 match_count = match_count + 1;
-                pmix_ifindextoname(selected_interface->if_kernel_index, if_name, sizeof(if_name));
-                for (n = 0; n < interface_count; n++) {
-                    if (0 == strcmp(if_name, interfaces[n])) {
-                        break;
+                pmix_ifkindextoname(selected_interface->if_kernel_index, if_name, sizeof(if_name));
+                found = false;
+                if (NULL != interfaces) {
+                    for (n = 0; NULL != interfaces[n]; n++) {
+                        if (0 == strcmp(if_name, *interfaces[n])) {
+                            found = true;
+                            break;
+                        }
                     }
                 }
-                if (n == interface_count) {
+                if (!found) {
                     pmix_output_verbose(20,
                                         prte_oob_base_framework.framework_output,
                                         "oob:tcp: Found match: %s (%s)",
                                         pmix_net_get_hostname((struct sockaddr*) &if_inaddr),
                                         if_name);
-                    PMIX_ARGV_APPEND_NOSIZE_COMPAT(&interfaces, if_name);
-                    ++interface_count;
+                    PMIX_ARGV_APPEND_NOSIZE_COMPAT(interfaces, if_name);
                 }
             }
         }
@@ -1206,14 +1217,15 @@ static char **split_and_resolve(char **orig_str, char *name)
         free(tmp);
     }

-    /* Mark the end of the interface name array with NULL */
-    if (NULL != interfaces) {
-        interfaces[interface_count] = NULL;
-    }
+    // cleanup and construct output string
     free(argv);
     free(*orig_str);
-    *orig_str = PMIX_ARGV_JOIN_COMPAT(interfaces, ',');
-    return interfaces;
+    if (NULL != interfaces) {
+        *orig_str = PMIX_ARGV_JOIN_COMPAT(*interfaces, ',');
+    } else {
+        *orig_str = NULL;
+    }
+    return;
 }

 /* OOB TCP Class instances */
diff --git a/src/runtime/prte_mca_params.c b/src/runtime/prte_mca_params.c
index 69477eefaa..085fc141cb 100644
--- a/src/runtime/prte_mca_params.c
+++ b/src/runtime/prte_mca_params.c
@@ -17,7 +17,7 @@
  * Copyright (c) 2014-2018 Research Organization for Information Science
  *                         and Technology (RIST).  All rights reserved.
  * Copyright (c) 2017      IBM Corporation.  All rights reserved.
- * Copyright (c) 2021-2024 Nanook Consulting.  All rights reserved.
+ * Copyright (c) 2021-2025 Nanook Consulting  All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -198,7 +198,7 @@ int prte_register_params(void)
          "open" failing is not printed */
         pmix_show_help("help-oob-tcp.txt", "include-exclude", true,
                        prte_if_include, prte_if_exclude);
-        return PRTE_ERR_NOT_AVAILABLE;
+        return PRTE_ERR_SILENT;
     }

     prte_set_max_sys_limits = NULL;

@amjal
Copy link
Author

amjal commented Mar 19, 2025

Thanks!
I think this patch is against 3.0.8, but the version I have is 3.0.6. Therefore, I am getting "patch does not apply" errors. Is it safe to just download and build 3.0.8 in 3rd-party?

@rhc54
Copy link
Contributor

rhc54 commented Mar 19, 2025

Maybe? Honestly don't know as I've never tried it. However, your only other options are to (a) build/use an external copy of PRRTE. Little more involved - you'd need copies of hwloc, libevent, and PMIx available for it, or (b) wait for the next OMPI release.

@jsquyres
Copy link
Member

Looks like this patch will be included in the upcoming Open MPI v5.0.8 release. Thanks @rhc54!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants