-
Notifications
You must be signed in to change notification settings - Fork 900
SM OSC component initialization crash #9833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmm, as far as performance goes osc/rdma and osc/sm should be about the same for an all on-node communicator. Not sure which one is preferable though. osc/rdma does not win in the allocate shared case though (since it doesn't support the function to get the pointers). As for the error handing, yes, it is bad. I had been meaning to rework it then I didn't have time for it. I can probably help look tomorrow if you want. |
Yeah, something's also broken in allocate shared. I haven't figure out what. If you want to start disagreeing with #9835, that would be a better use of time; I think I know how to fix these issues. |
The error is due to wrong code in static int opal_str_to_bool_impl(const char *string, bool *interp)
{
const char *ptr = string;
/* Trim leading whitespace */
while (isspace(*ptr)) {
++ptr;
}
if ('\0' != *ptr) {
if (isdigit(*ptr)) {
*interp = (bool) atoi(ptr);
} else if (0 == strncasecmp(ptr, "yes", 3) || 0 == strncasecmp(ptr, "true", 4)) {
*interp = true;
} else if (0 == strncasecmp(ptr, "no", 2) && 0 == strncasecmp(ptr, "false", 5)) { <--- BUG HERE !
*interp = false;
} else {
*interp = false;
return OPAL_ERR_BAD_PARAM;
}
}
return OPAL_SUCCESS;
} |
Pr's merged to master and v5.0.x. Can we close this? |
In both master and v5.0.x, there's a crash if the SM OSC component is selected. The end result is a messed up stack, but tracking through the code, there are multiple issues that need to be fixed:
in SM OSC's
component_select()
, there are numerous places thatgoto error
without ensuring thatret
is set to something other thanOMPI_SUCCESS
. The error handling frees the module and the window initialization code then calls into the module (which is random memory) and it all goes south from there. We need to audit to ensure that all error paths in the function properly return an error code.We're experiencing an error in
component_select()
because the call toopal_info_get_bool(... "alloc_shared_noncontig", ...)
fails. I haven't looked at why.This generally doesn't pop up because the RDMA OSC component has higher priority than the SM OSC component. I'm not sure if this is really the right choice or not; we need to discuss that as well.
Marking this as a blocker while we sort out what we should do.
The text was updated successfully, but these errors were encountered: