Skip to content

ops: embiggen the ops framework #13015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

hppritcha
Copy link
Member

to take a count argument of type size_t.

related to #12226

related to #9194

to take a count argument of type size_t.

related to open-mpi#12226

related to open-mpi#9194

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha hppritcha marked this pull request as draft January 3, 2025 17:35
@hppritcha hppritcha marked this pull request as ready for review January 5, 2025 20:54
@hppritcha hppritcha requested review from ggouaillardet and devreal and removed request for ggouaillardet January 5, 2025 20:54
@@ -588,15 +573,23 @@ static inline void ompi_op_reduce(ompi_op_t * op, const void *source,
op->o_func.java_data.object);
return;
}
op->o_func.c_fn(source, target, &count, &dtype);
if (0 == (op->o_flags & OMPI_OP_FLAGS_BIGCOUNT)) {
op->o_func.c_fn(source, target, &count, &dtype);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this code be moved here? We still need to handle larger counts even if the user hasn't provided a big-count callback.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the comment. What is the this code link referring to? My understanding of the standard is that for user defined reduction functions that if they are expecting to need to handle big count elements they need to define a bigcount aware reduction op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed the blob of code that deals with larger-than INT_MAX elements here: https://github.com/open-mpi/ompi/pull/13015/files#diff-dfd595860dc4d0c9c73114bd8563c0786d063327f6e30955a3e361c3f6098e3dL510-L539

Should this be moved into the code path where we deal with user reduction operators that are not big-count ready? Otherwise we might have overflows if the we want to reduce 4M elements with a "legacy" user reduction operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that was stuff josh added to handle the fact that the ompi_op_reduce was already using a size_t full_count arg for some reason before this PR, but our internal op table methods could only handle int.

ompi/op/op.h Outdated
@@ -507,36 +518,6 @@ static inline void ompi_op_reduce(ompi_op_t * op, const void *source,
MPI_Fint f_dtype, f_count;
int count = full_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells like UB and I'm worried the compiler might some day use it to remove any checks for INT_MAX. Maybe count needs to be size_t as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll move count just to the section of code invoking user op_reduce methods that are using the little count interface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. actually the verbiage in the 4.1 standard in section 6.9.5 makes me think we do need to keep that loop thing but only for user "small count" ops. so you're right! good catch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devreal check now. check if full_count bigger than INT_MAX and user defined operator is of small count variety, if yes, loop.

Signed-off-by: Howard Pritchard <[email protected]>
}
return;
}
int count = (int)full_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still flaky. If the compiler sees this assignment it can assume (because of UB) that it does not cause an overflow (and thus full_count <= INT_MAX) so this jeopardizes the check below. I'm not even sure we still need count here. When we do need to pass an int instead of size_t we should have a local variable count and the assignment there, only if we know for sure that we don't overflow.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two ways to "address" the big count issue in the op. One is the industrious effort, partially correct, proposed in this PR. The other, is to take advantage of the fact that the developers of op framework were forward looking and decided to handle everything not with a direct call into the op module, but by an intermediary call ompi_op_reduce (and the 3 arg variant) that already take a size_t.

We can indeed have the same level of functionality with minimal changes to the op creation (to add the size_t user defined op), and then a simple change to the ompi_op_reduce to handle it. No need to change the version of of the component, no need to touch a large portion of other people's code, no negative performance impact, and a solution in less than 100 lines of code.

@hppritcha
Copy link
Member Author

Hmmm... okay I will refactor this to not make any(hopefully) changes to the ops framework.

@hppritcha
Copy link
Member Author

closing this and replacing with much simpler PR #13030

@hppritcha hppritcha closed this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants