Skip to content

SM OSC component initialization crash #9833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bwbarrett opened this issue Jan 5, 2022 · 4 comments
Closed

SM OSC component initialization crash #9833

bwbarrett opened this issue Jan 5, 2022 · 4 comments

Comments

@bwbarrett
Copy link
Member

In both master and v5.0.x, there's a crash if the SM OSC component is selected. The end result is a messed up stack, but tracking through the code, there are multiple issues that need to be fixed:

  1. in SM OSC's component_select(), there are numerous places that goto error without ensuring that ret is set to something other than OMPI_SUCCESS. The error handling frees the module and the window initialization code then calls into the module (which is random memory) and it all goes south from there. We need to audit to ensure that all error paths in the function properly return an error code.

  2. We're experiencing an error in component_select() because the call to opal_info_get_bool(... "alloc_shared_noncontig", ...) fails. I haven't looked at why.

This generally doesn't pop up because the RDMA OSC component has higher priority than the SM OSC component. I'm not sure if this is really the right choice or not; we need to discuss that as well.

Marking this as a blocker while we sort out what we should do.

@hjelmn
Copy link
Member

hjelmn commented Jan 5, 2022

Hmm, as far as performance goes osc/rdma and osc/sm should be about the same for an all on-node communicator. Not sure which one is preferable though. osc/rdma does not win in the allocate shared case though (since it doesn't support the function to get the pointers).

As for the error handing, yes, it is bad. I had been meaning to rework it then I didn't have time for it. I can probably help look tomorrow if you want.

@bwbarrett
Copy link
Member Author

Hmm, as far as performance goes osc/rdma and osc/sm should be about the same for an all on-node communicator. Not sure which one is preferable though. osc/rdma does not win in the allocate shared case though (since it doesn't support the function to get the pointers).

As for the error handing, yes, it is bad. I had been meaning to rework it then I didn't have time for it. I can probably help look tomorrow if you want.

Yeah, something's also broken in allocate shared. I haven't figure out what.

If you want to start disagreeing with #9835, that would be a better use of time; I think I know how to fix these issues.

@albandil
Copy link

albandil commented Feb 5, 2022

  1. We're experiencing an error in component_select() because the call to opal_info_get_bool(... "alloc_shared_noncontig", ...) fails. I haven't looked at why.

The error is due to wrong code in opal_str_to_bool_impl (see the snippet from current master below). To evaluate string as false, it needs to be both "no" and "false", which is obviously impossible. The key "alloc_shared_noncontig" in question is normally "false". So, in the current state of the code it falls through to OPAL_ERR_BAD_PARAM.

static int opal_str_to_bool_impl(const char *string, bool *interp)
{
    const char *ptr = string;

    /* Trim leading whitespace */
    while (isspace(*ptr)) {
        ++ptr;
    }

    if ('\0' != *ptr) {
        if (isdigit(*ptr)) {
            *interp = (bool) atoi(ptr);
        } else if (0 == strncasecmp(ptr, "yes", 3) || 0 == strncasecmp(ptr, "true", 4)) {
            *interp = true;
        } else if (0 == strncasecmp(ptr, "no", 2) && 0 == strncasecmp(ptr, "false", 5)) {    <--- BUG HERE !
            *interp = false;
        } else {
            *interp = false;
            return OPAL_ERR_BAD_PARAM;
        }
    }
    return OPAL_SUCCESS;
}

@awlauria
Copy link
Contributor

awlauria commented Mar 31, 2022

Pr's merged to master and v5.0.x. Can we close this?

@awlauria awlauria reopened this Mar 31, 2022
@awlauria awlauria closed this as completed Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants