-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[OpenMP Dialect] Add omp.canonical_loop operation. #65380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch continues the work of D147658. It adds the `omp.canonical_loop` operation as the basic block for everything loop-related in OpenMP, such as worksharing-loop, distribute, loop transformation, etc. In contrast to the current `omp.wsloop` approach * Loop-related semantics need to be implemented only once * Is composable with OpenMP loop transformations such as unrolling, tiling. * Is supposed to eventually support non-rectangular loops * Supports expressing non-perfectly nested loops This patch only adds the MLIR representation; to something useful, I still have to implement lowering from Flang with at least the DO construct, and lowering to LLVM-IR using the OpenMPIRBuilder. The pretty syntax currently is ``` omp.canonical_loop $iv in [0, %tripcount) { ... } ``` where `[0, %tripcount)` represents the half-open integer range of an OpenMP logical iteration space. Unbalanced parentheses/brackets and `0` keyword might not be universally liked. I could think of alternatives such as ``` omp.canonical_loop $iv = range(%tripcount) { ... } ``` Differential Revision: https://reviews.llvm.org/D155765
Moved this from https://reviews.llvm.org/D155765 here, because phabricator has been unresponsive lately (my reply won't go for some reason). Previous discussion is in the patch itself. |
I don't have more information than what is there in the TR. FWIU, it is @Meinersbur who is driving this work, and will be best placed to provide clarification on what will be in the OpenMP 6 standard and what will be the future changes.
May be the structured block that is not a loop can be represented by a loop with a trip count of 1?
|
So, are we planning to modify a loop in-place when
changes to
If this change is happening in-place on calling the operation |
Sure, I will wait for his response about this. |
I have not followed your syntax fully. But there is both a partial unroll and a full unroll. In the case of a partial unroll what is returned will be a |
I agree that Here is one possible translation of
In this case, during compilation, after parsing the second line, A second possible translation would be
In this case, after the second line, we have two tiled loops. I personally like the second way of doing things, because it keeps IR clean and easy to understand, but wanted to know if there were other opinions about this or if I had misunderstood something. |
Unfortunately nesting instead of the %cli object like this:
is that it does not work with the
where after tiling, the inner loop is unrolled. The MLIR representation would be like this:
A more useful example than the above (which is just equivalent to partial unrolling by 4), would be a 2d-tiling followed with a Without the %cli reference, one could need to introduce a language that identifies the loop to be transformed a la XPath, e.g. omp.unroll_full { something the describes the second loop in the loop nest } // `loop(1)` in https://reviews.llvm.org/D155765#4638411
omp.tile sizes(4) {
omp.canonical_loop %iv = [0, %tc) {
..
}
} This scheme is fragile, as the nested code could be transformed itself, e.g. one loop is empty or consists of only one iteration and is optimized away. An existing reference to a In the OpenMP spec, my original proposal was to allow the user to give those generates loop names (idea stolen from the xlc compiler) to avoid extensive nesting with chains of transformations. E.g.:
After big discussions on what the namespace of the loopids would be, we settled on the For loop fission, we are going to add a new notion of "canonical loop sequence" to the specification, analogous to "canonical loop nest". The loop nests built out of
Like with loop nests, you should be allowed to apply another transformation on only a subset of the generated loops. For instance with loop fission:
Syntactically, this would be represented as
|
I believe the former is the original proposal from @Meinersbur. But Michael can confirm it. |
[I previously unintentionally edited https://github.com//pull/65380#issuecomment-1709199229 instead of quoting it -- didn't know that I could even edit other people's comments]
I had previously proposed (and discussed with @shraiysh) an If I understand correctly, the |
If we encode the input loop (and maybe output) loop structures inside the omp ops to make it easier to see what the ops actually encode as far as loop structure e.g. loop(loop(loop())) if there are three nested loops and in the seq(loop(),loop()) for two sequential loops like in the example later on. The nested version would be:
So this works fine even if the inner loop is unrolled, since the omp.tile would transform the structure of the loop nest.
I don't think it is any more fragile. Recomputing the loop structure in the nesting case should catch this.
Yes, if sequences are allowed, then it becomes more complicated, instead of a list it will be a tree. This tree will still have to be reconstructed by following the use-defs chains for doing code generation. There are error conditions that are possible with names that are not possible with nesting e.g. using the same name in two different ops, missing yield ops, ordering etc.
If there were loops before or after the omp.fissure (or both) would the implicit ones be listed first and then the ones with omp.yield? It seems a bit complicated to keep track of the various Ids and where they come from. FWIW it is still somewhat hard to see with nesting, If there are two inner loops:
I think both schemes would work if we are limited to trees of loop nests. seems more a matter of what is more convenient and practical.
|
Would the omp.canonical_loop, omp.tile etc live throughout the compilation, or would they be lowered early on in HLFIR/FIR? I am a suspicious that if these are present in the IR when running various MLIR transforms things could easily go wrong. Would it be possible to discover just the canonical loops late if they are needed for codegen going from MLIR->LLVM-IR? |
They will live through the entire flow and be lowered by the OpenMP IRBuilder. The OpenMP IRBuilder already supports lowering of canonical loops and some transforms on it. We already use it for collapse and simd. If you could elaborate on the issues you suspect that would be great. |
Thank you for the discussion in the call @jsjodin @kiranchandramohan @Meinersbur! This patch is open for review now.
This is a good point. I don't think there will be a case like this, so we can assume that these operations modify the loops in-place. |
Potentially reordering might be an issue, Transforms it might think it would be okay to move code like this:
to:
I'm not sure if this is really an issue depending on where the loop should be generated. Is it at the unroll, or the canonical loop location? Maybe there are ways around this, and maybe there are more complicated cases with multiple loops that are harder to resolve? |
Thanks @jsjodin, I thought about this and mentioned a comment on the Specifically, this point -
I think this check would be required as a part of the verifier. |
If an optimization makes a change that violates the invariant will the compiler assert? I also had another question. Is legal OpenMP?
Would this be represented as a canonical loop, and what would the MLIR for it look like? |
Yes, that will be a compilation failure. @Meinersbur would it be okay if we do loop transforms within MLIR in the very beginning to avoid such optimization issues? I understand that it is not in line with the aim of using OpenMPIRBuilder for common OpenMP stuff, but it would ensure correctness.
Not sure if it is legal, but if legal, then the transformed MLIR would be
|
It is not clear which transformation/optimization would cause this issue. Can this be fixed by: In general, choosing a flow other than the OpenMPIRBuilder should be the last resort. It will require a discussion outside this Pull Request involving the llvm OpenMP folks (who pushed for this flow) and the MLIR team (who agreed to this flow). |
Could an optimization that handles loads/stores potentially transform:
into:
My concern is that there is an implicit loop that is not visible, any kind of mem2reg type of transform could potentially be wrong. There is a similar discussion regarding CSE, which could also be an issue here, meaning the add could be hoisted or folded out of the canonical_loop. To add these kinds of operations where the regions change the execution mode in some way e.g. changing target, multi-threading, implicit loops etc. we have to go through the analyses and optimizations and improve/add interfaces so that these ops are handled correctly. I know this problem might exist for other dialects as well, but I'm wondering if those dialects get lowered before we get to a point where these kinds of scalar transformations become an issue? |
Regarding the implicit nature of the loop, this is no different from the existing OpenMP worksharing loop operation. We probably have to add the When a mem2reg transformation happens, the loop-carried values will be propagated and the initial value will be an operand of the loop (this is modelled as I think it might be good to move this particular discussion as a question to MLIR discourse, so we can get some input from the experts as well. |
This seems like a reasonable approach, since the loop ops don't really execute.
The problem I see with this approach is that ops that might be inserted between the loop transform ops might not have any side effects.
I think it can be made to work. I am just trying to think potential issues so that we have more confidence that things will work. Reducing the number of invariants is also a goal, because it is makes it more difficult to reason about the correctness of code, because of a bunch extra rules of rules that need to be kept track of. |
I'm waiting to see what happens with the CSE discussions. I think it will be fine to use/add/extend interfaces to capture what is going on as long as the interfaces capture the semantics, and we don't use existing interfaces that just happen to have the effect that we want. |
The conclusion for CSE (and other optimizations) is to make omp.target IsolatedFromAbove. This seems to be the most practical approach. This is something that may have to be done for other ops as well if optimizations cause problems because of omp op semantics. |
The IsolatedFromAbove trait also makes sense for the omp.canonical_loop operation. We just need to resolve the issue with branching in structured region with the yields. I have elaborated on this issue here |
More discussions here: #67720 |
Is this still relevant? |
This patch continues the work of D147658. It adds the
omp.canonical_loop
operation as the basic block for everything loop-related in OpenMP, such as worksharing-loop, distribute, loop transformation, etc.In contrast to the current
omp.wsloop
approachThis patch only adds the MLIR representation; to something useful, I still have to implement lowering from Flang with at least the DO construct, and lowering to LLVM-IR using the OpenMPIRBuilder.
The pretty syntax currently is
where
[0, %tripcount)
represents the half-open integer range of an OpenMP logical iteration space. Unbalanced parentheses/brackets and0
keyword might not be universally liked. I could think of alternatives such asDifferential Revision: https://reviews.llvm.org/D155765