-
Notifications
You must be signed in to change notification settings - Fork 900
mpirun 5.0.5 - TCP connection failure between hosts with multiple network interfaces #13155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Try something like this: mpirun --prtemca prte_if_exclude 192.168.122.0/24 --pmixmca ptl_base_if_exclude 192.168.122.0/24 --mca btl_tcp_if_exclude 192.168.122.0/24 ... The pmixmca entry may not be required, but can't hurt. The PRRTE param name isn't the same as the older OMPI versions, so you may have picked up a stale one. |
Thanks for answering so quickly! I tried these two: mpirun --prtemca prte_if_exclude 192.168.122.0/24 --pmixmca ptl_base_if_exclude 192.168.122.0/24 --mca btl_tcp_if_exclude 192.168.122.0/24 --mca plm_base_verbose 100 --debug-daemons --mca oob_base_verbose 100 -n 1 --host ucc-h2 hostname mpirun --prtemca prte_if_exclude 192.168.122.0/24 --pmixmca ptl_base_if_exclude 192.168.122.0/24 --mca btl_tcp_if_exclude 192.168.122.0/24 --mca oob_tcp_if_exclude 192.168.122.0/24 --prtemca oob_tcp_if_exclude 192.168.122.0/24 --mca plm_base_verbose 100 --debug-daemons --mca oob_base_verbose 100 -n 1 --host ucc-h2 hostname In both, lines like these appear:
|
I'm looking into it - there definitely is a bug in the include/exclude code path, but it's taking me a bit to track it down. |
Okay, I tracked it down - fix is going into upstream master branch. I'll backport it to the release branch - should be in next OMPI release. If you need it sooner, you could look at openpmix/prrte#2170. If you take that diff, you can apply it to the "3rd-party/prrte" directory in the OMPI release. Should fix the problem (did for my test case). |
Actually, that patch won't apply. Hang on until I backport it and then that patch will. I'll post it here when ready. |
Here is the diff you want - apply it to the "3rd-party/prrte" directory: diff --git a/src/mca/oob/tcp/oob_tcp_component.c b/src/mca/oob/tcp/oob_tcp_component.c
index e915198f95..b497b3819b 100644
--- a/src/mca/oob/tcp/oob_tcp_component.c
+++ b/src/mca/oob/tcp/oob_tcp_component.c
@@ -21,7 +21,7 @@
* Copyright (c) 2017 IBM Corporation. All rights reserved.
* Copyright (c) 2020 Amazon.com, Inc. or its affiliates. All Rights
* reserved.
- * Copyright (c) 2021-2024 Nanook Consulting All rights reserved.
+ * Copyright (c) 2021-2025 Nanook Consulting All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -386,7 +386,8 @@ static int tcp_component_register(void)
return PRTE_SUCCESS;
}
-static char **split_and_resolve(char **orig_str, char *name);
+static void split_and_resolve(char **orig_str, char *name,
+ char ***interfaces);
static int component_available(void)
{
@@ -408,12 +409,12 @@ static int component_available(void)
* subnet+mask
*/
if (NULL != prte_if_include) {
- interfaces = split_and_resolve(&prte_if_include,
- "include");
+ split_and_resolve(&prte_if_include,
+ "include", &interfaces);
including = true;
} else if (NULL != prte_if_exclude) {
- interfaces = split_and_resolve(&prte_if_exclude,
- "exclude");
+ split_and_resolve(&prte_if_exclude,
+ "exclude", &interfaces);
}
/* if we are the master, then check the interfaces for loopbacks
@@ -504,7 +505,7 @@ static int component_available(void)
pmix_net_get_hostname((struct sockaddr *) &my_ss),
(AF_INET == my_ss.ss_family) ? "V4" : "V6");
PMIX_ARGV_APPEND_NOSIZE_COMPAT(&prte_mca_oob_tcp_component.ipv4conns,
- pmix_net_get_hostname((struct sockaddr *) &my_ss));
+ pmix_net_get_hostname((struct sockaddr *) &my_ss));
} else if (AF_INET6 == my_ss.ss_family) {
#if PRTE_ENABLE_IPV6
pmix_output_verbose(10, prte_oob_base_framework.framework_output,
@@ -513,7 +514,7 @@ static int component_available(void)
pmix_net_get_hostname((struct sockaddr *) &my_ss),
(AF_INET == my_ss.ss_family) ? "V4" : "V6");
PMIX_ARGV_APPEND_NOSIZE_COMPAT(&prte_mca_oob_tcp_component.ipv6conns,
- pmix_net_get_hostname((struct sockaddr *) &my_ss));
+ pmix_net_get_hostname((struct sockaddr *) &my_ss));
#endif // PRTE_ENABLE_IPV6
} else {
pmix_output_verbose(10, prte_oob_base_framework.framework_output,
@@ -547,6 +548,9 @@ static int component_available(void)
PMIX_ARGV_APPEND_NOSIZE_COMPAT(&prte_mca_oob_tcp_component.if_masks, string);
pmix_list_append(&prte_mca_oob_tcp_component.local_ifs, &(copied_interface->super));
}
+ if (NULL != interfaces) {
+ PMIX_ARGV_FREE_COMPAT(interfaces);
+ }
if (0 == PMIX_ARGV_COUNT_COMPAT(prte_mca_oob_tcp_component.ipv4conns)
#if PRTE_ENABLE_IPV6
@@ -1091,40 +1095,43 @@ void prte_mca_oob_tcp_component_failed_to_connect(int fd, short args, void *cbda
* (a.b.c.d/e), resolve them to an interface name (Currently only
* supporting IPv4). If unresolvable, warn and remove.
*/
-static char **split_and_resolve(char **orig_str, char *name)
+static void split_and_resolve(char **orig_str, char *name,
+ char ***interfaces)
{
pmix_pif_t *selected_interface;
- int i, n, ret, match_count, interface_count;
- char **argv, **interfaces, *str, *tmp;
+ int i, n, ret, match_count;
+ bool found;
+ char **argv, *str, *tmp;
char if_name[IF_NAMESIZE];
struct sockaddr_storage argv_inaddr, if_inaddr;
uint32_t argv_prefix;
/* Sanity check */
if (NULL == orig_str || NULL == *orig_str) {
- return NULL;
+ return;
}
argv = PMIX_ARGV_SPLIT_COMPAT(*orig_str, ',');
if (NULL == argv) {
- return NULL;
+ return;
}
- interface_count = 0;
- interfaces = NULL;
for (i = 0; NULL != argv[i]; ++i) {
if (isalpha(argv[i][0])) {
/* This is an interface name. If not already in the interfaces array, add it */
- for (n = 0; n < interface_count; n++) {
- if (0 == strcmp(argv[i], interfaces[n])) {
- break;
+ found = false;
+ if (NULL != interfaces) {
+ for (n = 0; NULL != interfaces[n]; n++) {
+ if (0 == strcmp(argv[i], *interfaces[n])) {
+ found = true;
+ break;
+ }
}
}
- if (n == interface_count) {
+ if (!found) {
pmix_output_verbose(20,
prte_oob_base_framework.framework_output,
"oob:tcp: Using interface: %s ", argv[i]);
- PMIX_ARGV_APPEND_NOSIZE_COMPAT(&interfaces, argv[i]);
- ++interface_count;
+ PMIX_ARGV_APPEND_NOSIZE_COMPAT(interfaces, argv[i]);
}
continue;
}
@@ -1168,29 +1175,33 @@ static char **split_and_resolve(char **orig_str, char *name)
/* Go through all interfaces and see if we can find a match */
match_count = 0;
PMIX_LIST_FOREACH(selected_interface, &pmix_if_list, pmix_pif_t) {
- pmix_ifindextoaddr(selected_interface->if_kernel_index,
- (struct sockaddr*) &if_inaddr,
- sizeof(if_inaddr));
- if (pmix_net_samenetwork((struct sockaddr_storage*) &argv_inaddr,
+ ret = pmix_ifkindextoaddr(selected_interface->if_kernel_index,
+ (struct sockaddr*) &if_inaddr,
+ sizeof(if_inaddr));
+ if (PMIX_SUCCESS == ret &&
+ pmix_net_samenetwork((struct sockaddr_storage*) &argv_inaddr,
(struct sockaddr_storage*) &if_inaddr,
argv_prefix)) {
/* We found a match. If it's not already in the interfaces array,
add it. If it's already in the array, treat it as a match */
match_count = match_count + 1;
- pmix_ifindextoname(selected_interface->if_kernel_index, if_name, sizeof(if_name));
- for (n = 0; n < interface_count; n++) {
- if (0 == strcmp(if_name, interfaces[n])) {
- break;
+ pmix_ifkindextoname(selected_interface->if_kernel_index, if_name, sizeof(if_name));
+ found = false;
+ if (NULL != interfaces) {
+ for (n = 0; NULL != interfaces[n]; n++) {
+ if (0 == strcmp(if_name, *interfaces[n])) {
+ found = true;
+ break;
+ }
}
}
- if (n == interface_count) {
+ if (!found) {
pmix_output_verbose(20,
prte_oob_base_framework.framework_output,
"oob:tcp: Found match: %s (%s)",
pmix_net_get_hostname((struct sockaddr*) &if_inaddr),
if_name);
- PMIX_ARGV_APPEND_NOSIZE_COMPAT(&interfaces, if_name);
- ++interface_count;
+ PMIX_ARGV_APPEND_NOSIZE_COMPAT(interfaces, if_name);
}
}
}
@@ -1206,14 +1217,15 @@ static char **split_and_resolve(char **orig_str, char *name)
free(tmp);
}
- /* Mark the end of the interface name array with NULL */
- if (NULL != interfaces) {
- interfaces[interface_count] = NULL;
- }
+ // cleanup and construct output string
free(argv);
free(*orig_str);
- *orig_str = PMIX_ARGV_JOIN_COMPAT(interfaces, ',');
- return interfaces;
+ if (NULL != interfaces) {
+ *orig_str = PMIX_ARGV_JOIN_COMPAT(*interfaces, ',');
+ } else {
+ *orig_str = NULL;
+ }
+ return;
}
/* OOB TCP Class instances */
diff --git a/src/runtime/prte_mca_params.c b/src/runtime/prte_mca_params.c
index 69477eefaa..085fc141cb 100644
--- a/src/runtime/prte_mca_params.c
+++ b/src/runtime/prte_mca_params.c
@@ -17,7 +17,7 @@
* Copyright (c) 2014-2018 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2017 IBM Corporation. All rights reserved.
- * Copyright (c) 2021-2024 Nanook Consulting. All rights reserved.
+ * Copyright (c) 2021-2025 Nanook Consulting All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -198,7 +198,7 @@ int prte_register_params(void)
"open" failing is not printed */
pmix_show_help("help-oob-tcp.txt", "include-exclude", true,
prte_if_include, prte_if_exclude);
- return PRTE_ERR_NOT_AVAILABLE;
+ return PRTE_ERR_SILENT;
}
prte_set_max_sys_limits = NULL; |
Thanks! |
Maybe? Honestly don't know as I've never tried it. However, your only other options are to (a) build/use an external copy of PRRTE. Little more involved - you'd need copies of hwloc, libevent, and PMIx available for it, or (b) wait for the next OMPI release. |
Looks like this patch will be included in the upcoming Open MPI v5.0.8 release. Thanks @rhc54! |
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
$ prte_info --all | head PRTE: 3.0.6rc12025-03-17 PRTE repo revision: 2025-03-17 PRTE release date: @PMIX_RELEASE_DATE@ PMIx: OpenPMIx 5.0.3rc1 (PMIx Standard: 4.2, Stable ABI: 0.0, Provisional ABI: 0.0)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Downloaded the source code from https://www.open-mpi.org/. Then:
Please describe the system on which you are running
Operating system/version: Ubuntu 22.04.4 LTS
Computer hardware: x86_64
Network type:
I have two VM nodes on separate physical machines:
node 1 is named ucc-h2:
node 2 is named ucc-h5:
ens3 and enp5s1 don't ping each other. They are used for management (mostly ssh). The two ens7 interfaces are connected through a router so they can ping each other.
nc
works both ways. Meaningnc -l <port num>
on ucc-h2 andnc -N ucc-h2 <port num>
on ucc-h5 works fine and vice versa.ssh <hostname>
works without requiring password on both hosts.amir@ucc-h2:~$ ip route get 192.168.3.11 192.168.3.11 via 192.168.1.1 dev ens7 src 192.168.1.12 uid 1000 cache
amir@ucc-h5:~$ ip route get 192.168.1.12 192.168.1.12 via 192.168.3.1 dev ens7 src 192.168.3.11 uid 1000 cache
Yeah, I think that's it. But please let me know if I'm missing something I'll be happy to provide more info.
Details of the problem
I am trying to get
mpirun -n 1 --host <hostname> hostname
to work on both hosts.amir@ucc-h2:~$ mpirun -n 1 --host ucc-h5 hostname ucc-h5
So it's working fine on ucc-h2. But on ucc-h5:
It's timing out.
I ran the commands on both ucc-h2 and ucc-h5 with high verbosity to compare them:
The full outputs are really long so I've included them in separate files in this gist.
But in summary...
In both cases ( the successful one and the failing one) the remote daemon tries to establish connection to the master node using both interfaces. The connection using the wrong interface (ens3@ucc-h2 or enp5s0@ucc-h5) times out after a couple of retries:
Then tries the other interface (which is right one):
Here is, as far as I could tell, where thing are different depending which node is the master node, causing the asymmetric behaviour.
If the master node is ucc-h2:
The connection is established.
If the master node is ucc-h5:
It finds both addresses down.
I also dug a little in network traffic and found this on ens7@ucc-h2 when ucc-h5 is the master node (the failing case):
Looks like ucc-h2 is trying to talk to ucc-h5 through its ens7 interface but with the source IP of its ens3 interface! I don't have enough experience with networking to know how this could happen. This was actually a surprise to me. I don't know if this is the root cause or is a symptom of another issue.
I know that there are issues on openmpi acting weird when there are multiple interfaces on the host like #5818 and #12232. But I can't find my answer there.
I have tried all sorts of if_include/if_exclude flags on multiple mcas like opal, oob, prte, etc using both interface names and CIDR as parameters. But it's likely I have made a mistake so please let me know how it is properly done, I'm open to suggestions. For example, I tried this which made the most sense to me:
But it didn't change any outcome, it's still trying both interfaces.
This is the furthest I've been able to go. I appreciate any hints or directions for investigating this issue further. I haven't been able to reproduce/isolate it on the network side because all the tools that I know work normally. The issue only appears when using
mpirun
.The text was updated successfully, but these errors were encountered: