Skip to content

coll/han: Fix issue in reduce leading to invalid rbuf pointer #12434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 27, 2024

Conversation

gkatev
Copy link
Contributor

@gkatev gkatev commented Mar 25, 2024

This adjustment fixes the problem outlined in #11650. I looked through the other collectives in HAN and this seems to be the only spot where this particular discrepancy appears.


For non-node-leader ranks, t->rbuf will not point to valid memory (as far as we know). In the current code, however it's incremented (for non-leader ranks in the same node as the global root) for each task, as if valid.

This has not been a problem until now because this pointer is not actually utilized anywhere, but it will be when #11552 is merged. The reason is that components downstream from HAN, like XHC, are allowed to rely on the passed rbuf pointer being either NULL or pointing to a fully valid buffer.

Fixes #11650.
cc @bosilca.

@gkatev
Copy link
Contributor Author

gkatev commented Mar 25, 2024

Alternative fix:

diff --git a/ompi/mca/coll/han/coll_han_reduce.c b/ompi/mca/coll/han/coll_han_reduce.c
index a208b41d78..d4facdf26c 100644
--- a/ompi/mca/coll/han/coll_han_reduce.c
+++ b/ompi/mca/coll/han/coll_han_reduce.c
@@ -176,7 +176,7 @@ mca_coll_han_reduce_intra(const void *sbuf,
             t->sbuf = (char *) t->sbuf + extent * t->seg_count;
         }
 
-        if (up_rank == root_up_rank) {
+        if (up_rank == root_up_rank && low_rank == root_low_rank) {
             t->rbuf = (char *) t->rbuf + extent * t->seg_count;
         }
         t->cur_seg = t->cur_seg + 1;

@@ -176,7 +176,7 @@ mca_coll_han_reduce_intra(const void *sbuf,
t->sbuf = (char *) t->sbuf + extent * t->seg_count;
}

if (up_rank == root_up_rank) {
if (up_rank == root_up_rank && NULL != t->rbuf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm should this be if (root_low_rank == low_rank) instead?

Reading L142-L148, only the root and node leaders should have valid rbuf.

Copy link
Contributor Author

@gkatev gkatev Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say no, because the root has a fully sized rbuf, while non-root node leaders only have a temporary buffer that fits two segments (L122, L146). But what we could do is if (up_rank == root_up_rank && low_rank == root_low_rank). Or if (w_rank == root); should also be equivalent. These might indeed be more in tune with the rest of the code (opinions welcome).

Let me take another look at the code, perhaps one or two touchups in related spots would be a good idea, to make everything a bit clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up_rank == root_up_rank is true only for the root of the original reduce call, so the only process that has a user provided output buffer. Thus, it makes sense to move to base pointer as we advance in the segments.

For all leaders, aka. processes that become root for the local reductions, with the exception of the original root we only allocated a temporary buffer with size 2*extent*seg_count, so there is no room to move the pointer around (more than the double buffering in mca_coll_han_reduce_t1_task).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up_rank == root_up_rank is true only for the root of the original reduce call

-> For the global root, and for the ranks in the same node as it, right?

@wenduwan wenduwan self-requested a review March 25, 2024 21:17
@gkatev
Copy link
Contributor Author

gkatev commented Mar 26, 2024

Alright all done with the second look. I'm warming up to adjusting the PR in favor of this:

@@ -176,7 +182,7 @@ mca_coll_han_reduce_intra(const void *sbuf,
             t->sbuf = (char *) t->sbuf + extent * t->seg_count;
         }
 
-        if (up_rank == root_up_rank) {
+        if (low_rank == root_low_rank && up_rank == root_up_rank) {
             t->rbuf = (char *) t->rbuf + extent * t->seg_count;
         }
         t->cur_seg = t->cur_seg + 1;

Also, RE possible clearing up of related spots, here is one suggestion (I don't feel too strongly about it):

@@ -139,12 +139,18 @@ mca_coll_han_reduce_intra(const void *sbuf,
                          "[%d]: root_low_rank %d root_up_rank %d\n", w_rank, root_low_rank,
                          root_up_rank));
 
-    void *tmp_rbuf = rbuf;
+    void *tmp_rbuf = NULL;
     void *tmp_rbuf_to_free = NULL;
-    if (low_rank == root_low_rank && root_up_rank != up_rank) {
-        /* allocate 2 segments on node leaders that are not the global root */
-        tmp_rbuf = malloc(2*extent*seg_count);
-        tmp_rbuf_to_free = tmp_rbuf;
+    /* node leaders require a buffer to store reduction results */
+    if(low_rank == root_low_rank) {
+        if(up_rank == root_up_rank) {
+            /* the global root already has one */
+            tmp_rbuf = rbuf;
+        } else {
+            /* allocate 2 temporary segments on node leaders that are not the global root */
+            tmp_rbuf = malloc(2*extent*seg_count);
+            tmp_rbuf_to_free = tmp_rbuf;
+        }
     }
 
     /* Create t0 tasks for the first segment */

This is supposed to:

  1. Create a stronger association between the initialization of tmp_rbuf and the discussed fix (if (low_rank == root_low_rank && up_rank == root_up_rank)).
  2. Clearly initialize tmp_rbuf to NULL for non-node-leaders, making L252-L256 more robust.
    if (t->is_tmp_rbuf) {
    tmp_rbuf = (char*)t->rbuf + (next_seg % 2)*(extent * t->seg_count);
    } else if (NULL != t->rbuf) {
    tmp_rbuf = (char*)t->rbuf + extent * t->seg_count;
    }

Please let me know what you think of the above, and I'll adjust the PR (or leave it as is).

@wenduwan
Copy link
Contributor

Thanks for looking closely. Just 2 small suggestions:

  1. In my opinion if (root == w_rank) is more readable than if (low_rank == root_low_rank && up_rank == root_up_rank)
  2. We need a NULL check after tmp_rbuf = malloc(2*extent*seg_count); to ensure success allocation.

Improve handling of the rbuf pointer. Previously, it was possible for the
pointer to become non-NULL, even though it does not point to valid memory.
This becomes a problem for the assurances that open-mpi#11552 tries to introduce.

Signed-off-by: George Katevenis <[email protected]>
@gkatev gkatev force-pushed the han_root_node_rbuf_fix branch from 62f7a8c to 0f0727e Compare March 26, 2024 20:00
@gkatev
Copy link
Contributor Author

gkatev commented Mar 26, 2024

I agree on w_rank being cleaner. I updated the PR to use this approach. I also adjusted tmp_rbuf's initialization to match this, and added error checking for the allocation. Please re-review.

@wenduwan
Copy link
Contributor

Running AWS CI

Copy link
Contributor

@wenduwan wenduwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change passed AWS CI.

@bosilca bosilca merged commit b5328be into open-mpi:main Mar 27, 2024
@gkatev gkatev deleted the han_root_node_rbuf_fix branch March 30, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problem with rbuf pointer in HAN's reduce
4 participants