Skip to content

OPAL/MCA/COMMON/UCX: #11632 bugfix - mca string variables registration #11640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

roiedanino
Copy link
Contributor

@roiedanino roiedanino commented May 3, 2023

Fixing #11632 segfault

The root cause for the segfault:

ompi_info was hitting a segfault while trying to deregister the UCX tls mca-variable after UCX components were already unloaded - resulting in illegal memory access.

Suggested solution

Cherry-pick from a commit that already fixed this issue in v4.1.x: d79d5e8

@github-actions
Copy link

github-actions bot commented May 3, 2023

Hello! The Git Commit Checker CI bot found a few problems with this PR:

c319ccb: OPAL/MCA/COMMON/UCX: changed tls var registration,...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@roiedanino roiedanino force-pushed the bugfix/opal-ucx-variable-segfault branch from c319ccb to 831c5b0 Compare May 3, 2023 14:05
@janjust
Copy link
Contributor

janjust commented May 5, 2023

@roiedanino could you update this PR?

@janjust
Copy link
Contributor

janjust commented May 5, 2023

And when merged, mind opening up a v5.0 PR please

@janjust janjust requested a review from yosefe May 5, 2023 13:40
@roiedanino roiedanino marked this pull request as draft May 7, 2023 13:17
@roiedanino roiedanino force-pushed the bugfix/opal-ucx-variable-segfault branch 3 times, most recently from beb2894 to 9104fc1 Compare May 7, 2023 15:16
@roiedanino
Copy link
Contributor Author

I've cherry-picked from d79d5e8
@jjhursey

@roiedanino roiedanino marked this pull request as ready for review May 7, 2023 15:21
@roiedanino roiedanino changed the title OPAL/MCA/COMMON/UCX: #11632 bugfix changed tls var registration to mca_base_component_var_register OPAL/MCA/COMMON/UCX: #11632 bugfix - mca string variables registration May 8, 2023
@janjust janjust force-pushed the bugfix/opal-ucx-variable-segfault branch from 9104fc1 to dfdc992 Compare May 9, 2023 20:16
@janjust
Copy link
Contributor

janjust commented May 10, 2023

@yosefe is this good to go?

@@ -5,6 +5,7 @@
* and Technology (RIST). All rights reserved.
* Copyright (c) 2019-2020 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2022 IBM Corporation. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we can add this..

@roiedanino roiedanino force-pushed the bugfix/opal-ucx-variable-segfault branch from e42bfc1 to 0ca009b Compare May 15, 2023 08:54
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, please separate whitespace / style changes from code changes (i.e., have them in different commits). This helps understand the git history easier.

At a minimum, please squash the 2 commits in this PR down to a single commit. I.e., don't have a wrong commit (with the IBM copyright) followed by a "fixup" commit to fix that errant copyright, because then that wrong commit will forever be in Open MPI's git history.

Thanks.

@roiedanino roiedanino force-pushed the bugfix/opal-ucx-variable-segfault branch from 0ca009b to ba0ee36 Compare May 15, 2023 14:32
@jsquyres jsquyres dismissed their stale review May 15, 2023 21:21

The 2 commits were squashed.

@jsquyres jsquyres marked this pull request as draft May 16, 2023 15:10
@jsquyres
Copy link
Member

We discussed this PR on the Tuesday call today. Conclusions:

  1. If doing what @hjelmn suggested is hard / a lot of work, then let's go ahead and merge this PR (since it's like how other places in the code base solve the problem), and defer the real fix to Clean up MCA var de-registration of common components #11680, which doesn't need to block v5.0.0.
  2. If what @hjelmn is easy, sure, let's go ahead and fold it into this PR, merge it, and cherry pick to v5.0.x in time for v5.0.0.

@roiedanino roiedanino marked this pull request as ready for review May 23, 2023 21:37
@janjust janjust force-pushed the bugfix/opal-ucx-variable-segfault branch from ba0ee36 to b210f19 Compare May 25, 2023 15:01
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the commit message say the commit was cherry picked from commit d79d5e8? This commit looks wholly unrelated to d79d5e8 (in the git sense), other than it uses the same technique as d79d5e8. But it's not a git cherry pick.

Please do not use a misleading commit message like that.

@roiedanino
Copy link
Contributor Author

Why does the commit message say the commit was cherry-picked from commit d79d5e8? This commit looks wholly unrelated to d79d5e8 (in the git sense), other than it uses the same technique as d79d5e8. But it's not a git cherry-pick.

Please do not use a misleading commit message like that.

Sorry, this commit was originally cherry-picked from d79d5e8, then @janjust updated it; I'll update the commit message.
@jsquyres

@roiedanino roiedanino force-pushed the bugfix/opal-ucx-variable-segfault branch from b210f19 to 4c9776d Compare May 29, 2023 08:45
@jsquyres jsquyres dismissed their stale review May 30, 2023 14:13

Addressed my objection

@jsquyres
Copy link
Member

This nominally looks correct to me, but I have no good way of testing it. Someone in the UCX community should validate this.

@roiedanino roiedanino force-pushed the bugfix/opal-ucx-variable-segfault branch from 4c9776d to 62f4933 Compare June 1, 2023 16:30
@roiedanino roiedanino force-pushed the bugfix/opal-ucx-variable-segfault branch from 62f4933 to 58fecda Compare June 5, 2023 08:24
@roiedanino roiedanino force-pushed the bugfix/opal-ucx-variable-segfault branch from 58fecda to 32aba0b Compare June 5, 2023 08:29
@janjust
Copy link
Contributor

janjust commented Jun 6, 2023

@roiedanino please open up v5.0 PR

@roiedanino
Copy link
Contributor Author

Porting to v5.0.x is at: #11738
@janjust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants