-
Notifications
You must be signed in to change notification settings - Fork 901
ops: embiggen the ops framework #13015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ops: embiggen the ops framework #13015
Conversation
to take a count argument of type size_t. related to open-mpi#12226 related to open-mpi#9194 Signed-off-by: Howard Pritchard <[email protected]>
@@ -588,15 +573,23 @@ static inline void ompi_op_reduce(ompi_op_t * op, const void *source, | |||
op->o_func.java_data.object); | |||
return; | |||
} | |||
op->o_func.c_fn(source, target, &count, &dtype); | |||
if (0 == (op->o_flags & OMPI_OP_FLAGS_BIGCOUNT)) { | |||
op->o_func.c_fn(source, target, &count, &dtype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this code be moved here? We still need to handle larger counts even if the user hasn't provided a big-count callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the comment. What is the this code link referring to? My understanding of the standard is that for user defined reduction functions that if they are expecting to need to handle big count elements they need to define a bigcount aware reduction op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You removed the blob of code that deals with larger-than INT_MAX elements here: https://github.com/open-mpi/ompi/pull/13015/files#diff-dfd595860dc4d0c9c73114bd8563c0786d063327f6e30955a3e361c3f6098e3dL510-L539
Should this be moved into the code path where we deal with user reduction operators that are not big-count ready? Otherwise we might have overflows if the we want to reduce 4M elements with a "legacy" user reduction operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that was stuff josh added to handle the fact that the ompi_op_reduce
was already using a size_t full_count arg for some reason before this PR, but our internal op table methods could only handle int.
ompi/op/op.h
Outdated
@@ -507,36 +518,6 @@ static inline void ompi_op_reduce(ompi_op_t * op, const void *source, | |||
MPI_Fint f_dtype, f_count; | |||
int count = full_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This smells like UB and I'm worried the compiler might some day use it to remove any checks for INT_MAX
. Maybe count
needs to be size_t
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll move count
just to the section of code invoking user op_reduce methods that are using the little count interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. actually the verbiage in the 4.1 standard in section 6.9.5 makes me think we do need to keep that loop thing but only for user "small count" ops. so you're right! good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devreal check now. check if full_count bigger than INT_MAX and user defined operator is of small count variety, if yes, loop.
Signed-off-by: Howard Pritchard <[email protected]>
} | ||
return; | ||
} | ||
int count = (int)full_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still flaky. If the compiler sees this assignment it can assume (because of UB) that it does not cause an overflow (and thus full_count <= INT_MAX
) so this jeopardizes the check below. I'm not even sure we still need count
here. When we do need to pass an int
instead of size_t
we should have a local variable count
and the assignment there, only if we know for sure that we don't overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two ways to "address" the big count issue in the op. One is the industrious effort, partially correct, proposed in this PR. The other, is to take advantage of the fact that the developers of op framework were forward looking and decided to handle everything not with a direct call into the op module, but by an intermediary call ompi_op_reduce
(and the 3 arg variant) that already take a size_t.
We can indeed have the same level of functionality with minimal changes to the op creation (to add the size_t user defined op), and then a simple change to the ompi_op_reduce
to handle it. No need to change the version of of the component, no need to touch a large portion of other people's code, no negative performance impact, and a solution in less than 100 lines of code.
Hmmm... okay I will refactor this to not make any(hopefully) changes to the ops framework. |
closing this and replacing with much simpler PR #13030 |
to take a count argument of type size_t.
related to #12226
related to #9194