Skip to content

Problem with rbuf pointer in HAN's reduce #11650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gkatev opened this issue May 5, 2023 · 0 comments · Fixed by #12434
Closed

Problem with rbuf pointer in HAN's reduce #11650

gkatev opened this issue May 5, 2023 · 0 comments · Fixed by #12434
Assignees

Comments

@gkatev
Copy link
Contributor

gkatev commented May 5, 2023

Hi, while testing the new HAN+XHC integration method, I came upon a bug/issue in HAN's reduce, in this part of the code:

if (up_rank == root_up_rank) {
t->rbuf = (char *) t->rbuf + extent * t->seg_count;
}

This condition will be true for non-root ranks in the same node as the root. But, for these ranks, rbuf has been previously initialized to NULL. Thus, t->rbuf will be set to something non-NULL, but won't point to valid memory. Later on, this influences the rbuf parameter to coll_reduce:

if (t->is_tmp_rbuf) {
tmp_rbuf = (char*)t->rbuf + (next_seg % 2)*(extent * t->seg_count);
} else if (NULL != t->rbuf) {
tmp_rbuf = (char*)t->rbuf + extent * t->seg_count;
}
t->low_comm->c_coll->coll_reduce((char *) t->sbuf + extent * t->seg_count,
(char *) tmp_rbuf, tmp_count,
t->dtype, t->op, t->root_low_rank, t->low_comm,
t->low_comm->c_coll->coll_reduce_module);

This is a problem when trying to detect (in XHC) if rbuf is valid or not, as this is done by checking if the pointer is NULL.
Related: #11552 and the discussion in #11418

I'm posting this as an issue instead of a PR, with the hopes that someone will take it through the last mile, as I'm not fully sure what the desired fix for this would be, or if there are similar occurrences in other HAN collectives that should also be adjusted. Something like this does fix it:

diff --git a/ompi/mca/coll/han/coll_han_reduce.c b/ompi/mca/coll/han/coll_han_reduce.c
index aae17a21fc..cce681ae2e 100644
--- a/ompi/mca/coll/han/coll_han_reduce.c
+++ b/ompi/mca/coll/han/coll_han_reduce.c
@@ -173,7 +173,7 @@ mca_coll_han_reduce_intra(const void *sbuf,
         /* Setup up t_next_seg task arguments */
         t->cur_task = t_next_seg;
         t->sbuf = (char *) t->sbuf + extent * t->seg_count;
-        if (up_rank == root_up_rank) {
+        if (up_rank == root_up_rank && NULL != t->rbuf) {
             t->rbuf = (char *) t->rbuf + extent * t->seg_count;
         }
         t->cur_seg = t->cur_seg + 1;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants