Skip to content

[OpenMP Dialect] Add omp.canonical_loop operation. #65380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

shraiysh
Copy link
Member

@shraiysh shraiysh commented Sep 5, 2023

This patch continues the work of D147658. It adds the omp.canonical_loop operation as the basic block for everything loop-related in OpenMP, such as worksharing-loop, distribute, loop transformation, etc.

In contrast to the current omp.wsloop approach

  • Loop-related semantics need to be implemented only once
  • Is composable with OpenMP loop transformations such as unrolling, tiling.
  • Is supposed to eventually support non-rectangular loops
  • Supports expressing non-perfectly nested loops

This patch only adds the MLIR representation; to something useful, I still have to implement lowering from Flang with at least the DO construct, and lowering to LLVM-IR using the OpenMPIRBuilder.

The pretty syntax currently is

omp.canonical_loop $iv in [0, %tripcount) { ... }

where [0, %tripcount) represents the half-open integer range of an OpenMP logical iteration space. Unbalanced parentheses/brackets and 0 keyword might not be universally liked. I could think of alternatives such as

omp.canonical_loop $iv = range(%tripcount) { ... }

Differential Revision: https://reviews.llvm.org/D155765

This patch continues the work of D147658. It adds the `omp.canonical_loop` operation as the basic block for everything loop-related in OpenMP, such as worksharing-loop, distribute, loop transformation, etc.

In contrast to the current `omp.wsloop` approach
 * Loop-related semantics need to be implemented only once
 * Is composable with OpenMP loop transformations such as unrolling, tiling.
 * Is supposed to eventually support non-rectangular loops
 * Supports expressing non-perfectly nested loops

This patch only adds the MLIR representation; to something useful, I still have to implement lowering from Flang with at least the DO construct, and lowering to LLVM-IR using the OpenMPIRBuilder.

The pretty syntax currently is
```
omp.canonical_loop $iv in [0, %tripcount) { ... }
```
where `[0, %tripcount)` represents the half-open integer range of an OpenMP logical iteration space. Unbalanced parentheses/brackets and `0` keyword might not be universally liked. I could think of alternatives such as
```
omp.canonical_loop $iv = range(%tripcount) { ... }
```

Differential Revision: https://reviews.llvm.org/D155765
@shraiysh shraiysh requested a review from a team as a code owner September 5, 2023 16:45
@github-actions github-actions bot added mlir:core MLIR Core Infrastructure mlir labels Sep 5, 2023
@shraiysh
Copy link
Member Author

shraiysh commented Sep 5, 2023

Moved this from https://reviews.llvm.org/D155765 here, because phabricator has been unresponsive lately (my reply won't go for some reason). Previous discussion is in the patch itself.

@kiranchandramohan
Copy link
Contributor

In D155765#4638411, @shraiysh wrote:
There are two possible approaches for the unrolled operation here -

%unrolled1 = omp.unroll loops(%tiled) at(1)
%unrolled2 = omp.unroll loops(%tiled#1)
I realized that we cannot really do the first approach, because unroll could be present without nested loops. The nesting of openmp constructs could help with this. It might require complex indexing though. @kiranchandramohan do we have the details on loop-fission or the loop-modifiers for apply (like intratile)? I checked the technical report document here but it doesn't mention too much detail about these modifiers.

I don't have more information than what is there in the TR. FWIU, it is @Meinersbur who is driving this work, and will be best placed to provide clarification on what will be in the OpenMP 6 standard and what will be the future changes.

There are two possible approaches for the unrolled operation here -

%unrolled1 = omp.unroll loops(%tiled) at(1)
%unrolled2 = omp.unroll loops(%tiled#1)
The good thing with the first approach, with at(1) - is that because we have both the loops as input, we transform the inner > loop and we have the new single loop as output which can directly be used in other constructs. Note that this is similar to > Jan's suggestion about nesting and maintaining indices.
The problem with the second approach is that the input is only the inner loop and the output (%unrolled2) should represent a structured block (not a loop) which should be captured within the loop %tiled#0 somehow. (maybe an omp.merge_canonical_loop_with_body operation?).

May be the structured block that is not a loop can be represented by a loop with a trip count of 1?

omp.canonical_loop $iv in [0, %cst_1) { ... }

@shraiysh
Copy link
Member Author

shraiysh commented Sep 5, 2023

So, are we planning to modify a loop in-place when omp.unroll is called on that loop, or is this a new canonical loop object that would be used to create a loop nest? That is, while running the operation omp.unroll do we also do the following change:

[[in memory representation of canonical loop object]]
%tiled#0 = omp.canonical_loop [0, %tc) step = 4
|-[child] %tiled#1 = omp.canonical_loop [%i, %i+4) step = 1

changes to

[[in memory representation of canonical loop object]]
%tiled#0 = omp.canonical_loop [0, %tc) step=4
|-[child] %unrolled = omp.canonical_loop [0, 1) step=1

If this change is happening in-place on calling the operation omp.unroll then we do not need a result on LHS of omp.unroll. If it is not happening in-place, then we need another operation to now generate the second nested loop above. Is this correct or am I missing something?

@shraiysh
Copy link
Member Author

shraiysh commented Sep 5, 2023

FWIU, it is @Meinersbur who is driving this work, and will be best placed to provide clarification on what will be in the OpenMP 6 standard and what will be the future changes.

Sure, I will wait for his response about this.

@kiranchandramohan
Copy link
Contributor

So, are we planning to modify a loop in-place when omp.unroll is called on that loop, or is this a new canonical loop object that would be used to create a loop nest? That is, while running the operation omp.unroll do we also do the following change:

[[in memory representation of canonical loop object]]
%tiled#0 = omp.canonical_loop [0, %tc) step = 4
|-[child] %tiled#1 = omp.canonical_loop [%i, %i+4) step = 1

changes to

[[in memory representation of canonical loop object]]
%tiled#0 = omp.canonical_loop [0, %tc) step=4
|-[child] %unrolled = omp.canonical_loop [0, 1) step=1

If this change is happening in-place on calling the operation omp.unroll then we do not need a result on LHS of omp.unroll. If it is not happening in-place, then we need another operation to now generate the second nested loop above. Is this correct or am I missing something?

I have not followed your syntax fully. But there is both a partial unroll and a full unroll. In the case of a partial unroll what is returned will be a canonical_loop. To keep it consistent wouldn't we want to return a canonical_loop for the full unroll too?
I was suggesting, in general, for operations (like unroll, tile, interchange) to take in a canonical_loop and also to return as result a canonical_loop.

@shraiysh
Copy link
Member Author

shraiysh commented Sep 6, 2023

I agree that omp.unroll should result in an omp.canonical_loop object. My concern was how does this canonical loop object fit with the existing loop objects. I will try to rephrase and elaborate my point.

Here is one possible translation of #pragma omp for

%cli = omp.canonical_loop [0, %tc) {...}
%tiled = omp.tiled(%cli) { tile_size = 4 } -> omp.canonical_loop, omp.canonical_loop
omp.unroll(%tiled#1)
omp.wsloop(%tiled#0)

In this case, during compilation, after parsing the second line, %tiled#1 is a child loop of %tiled#0. After the third statement omp.unroll(%tiled#1), the unrolled loop replaces %tiled#1 in-place as a child of %tiled#0. So, when omp.wsloop(%tiled#0) is called, worksharing is applied to the outer tiled loop (%tiled#0) and the code is generated for outer loop with the unrolled loop in the body. In this case, we do not need the omp.unrol operation to return a result, because it is modifying the canonical loop info in-place.

A second possible translation would be #pragma omp for

%cli = omp.canonical_loop [0, %tc) {...}
%tiled = omp.tiled(%cli) { tile_size = 4 } -> omp.canonical_loop, omp.canonical_loop
%unrolled = omp.unroll(%tiled#0) -> omp.canonical_loop
%merged_loop = omp.replace(parent=%tiled#0, old_value=%tiled#1, new_value=%unrolled)
omp.wsloop(%merged_loop)

In this case, after the second line, we have two tiled loops. %tiled#0 is parent of %tiled#1. After the third line, we have an independent loop %unrolled. The next line then creates a nested loop, where it replaces %tiled#1 under %tiled#0 with %unrolled. Now, this %merged_loop is the one that worksharing is applied on. In this case we need a new operation to somehow compose the CanonicalLoopInfo objects to create new objects.

I personally like the second way of doing things, because it keeps IR clean and easy to understand, but wanted to know if there were other opinions about this or if I had misunderstood something.

@Meinersbur
Copy link
Member

Meinersbur commented Sep 6, 2023

Unfortunately nesting instead of the %cli object like this:

omp.unroll loop(1)
  omp.tile loop (0) { tile_sizes=[4] } {
    omp.canonical_loop for (%c0) to (%c64) step (%c1) {
      ..
    }
  }
}

is that it does not work with the apply clause. For instance:

#pragma omp tile sizes(4) apply(intratile:unroll full)
for (int i =0; i < n; ++i) {
   ...
}

where after tiling, the inner loop is unrolled. The MLIR representation would be like this:

%cli = omp.canonical_loop %iv = [0, %tc) {
      ..
}
%grid, %intratile = omp.tile sizes(4) (%cli)
omp.unroll_full (%intratile)

A more useful example than the above (which is just equivalent to partial unrolling by 4), would be a 2d-tiling followed with a simd-ization of one (or both) of the inner loops. Since they are constant-sized, it makes them an ideal target for vectorization.

Without the %cli reference, one could need to introduce a language that identifies the loop to be transformed a la XPath, e.g.

omp.unroll_full { something the describes the second loop in the loop nest } // `loop(1)` in https://reviews.llvm.org/D155765#4638411
omp.tile sizes(4) { 
  omp.canonical_loop %iv = [0, %tc) {
        ..
  }
}

This scheme is fragile, as the nested code could be transformed itself, e.g. one loop is empty or consists of only one iteration and is optimized away. An existing reference to a %cli would make such thing a hard error.
With the addition of loop sequences (see below), it is not just the nest depth, but also which of the loop in a loop sequence. E.g. "the second nested loop in the third loop of the loop sequence which is nested inside another loop".

In the OpenMP spec, my original proposal was to allow the user to give those generates loop names (idea stolen from the xlc compiler) to avoid extensive nesting with chains of transformations. E.g.:

#pragma omp unroll on(mytiledloop)
#pragma omp tile on(myoriginalloop) sizes(4) generates(intratile:mytiledloop)

#pragma omp loopid(myoriginalloop)
for (int i =0; i < n; ++i) {
   ...
}

After big discussions on what the namespace of the loopids would be, we settled on the apply clause. They are equally powerful, but I think the loopids would allow make it easier to use, e.g. apply different transformations for each target architecture.


For loop fission, we are going to add a new notion of "canonical loop sequence" to the specification, analogous to "canonical loop nest". The loop nests built out of omp.canonical_loop are implicit, by which ones are combined in transformation such as collapse. We can do the same with sequences:

%cli1 = omp.canonical_loop %iv1 = [0, %tc) {
      ..
}
%cli2 = omp.canonical_loop %iv2 = [0, %tc) {
      ..
}
%fused = omp.fuse (%cli1, %cli2)

Like with loop nests, you should be allowed to apply another transformation on only a subset of the generated loops. For instance with loop fission:

%cli1, %cli2 = omp.canonical_loop %iv = [0, %tc) {
      ...
      omp.fissure()
      ...
}
omp.unroll (%cli2)

Syntactically, this would be represented as

#pragma omp fission apply(nothing,unroll)
for (int i = 0; i < n; ++i) {
  ...
  #pragma omp fissure
  ...
}

@kiranchandramohan
Copy link
Contributor

kiranchandramohan commented Sep 6, 2023

%cli = omp.canonical_loop [0, %tc) {...}
%tiled = omp.tiled(%cli) { tile_size = 4 } -> omp.canonical_loop, omp.canonical_loop
%unrolled = omp.unroll(%tiled#0) -> omp.canonical_loop
%merged_loop = omp.replace(parent=%tiled#0, old_value=%tiled#1, new_value=%unrolled)
omp.wsloop(%merged_loop)

I believe the former is the original proposal from @Meinersbur. But Michael can confirm it.

@Meinersbur
Copy link
Member

Meinersbur commented Sep 6, 2023

[I previously unintentionally edited https://github.com//pull/65380#issuecomment-1709199229 instead of quoting it -- didn't know that I could even edit other people's comments]

%cli = omp.canonical_loop [0, %tc) {...}
%tiled = omp.tiled(%cli) { tile_size = 4 } -> omp.canonical_loop, omp.canonical_loop
%unrolled = omp.unroll(%tiled#0) -> omp.canonical_loop
%merged_loop = omp.replace(parent=%tiled#0, old_value=%tiled#1, new_value=%unrolled)
omp.wsloop(%merged_loop)

I believe the former is the original proposal from @Meinersbur. But Michael can confirm it.

I had previously proposed (and discussed with @shraiysh) an omp.execute. omp.canonical_loop would be asynchronous, i.e. like a lambda. To execute the loop, one would need to pass the %cli (or the %cli that is returned after a transformation) to omp.execute and only then it would execute with the transformations in the chain. I think because of the asynchronicity it would make everything more complex than necessary.

If I understand correctly, the omp.replace operation seems related, but only delays the application of transformation until it replaces another loop, but not the omp.canonical_loop itself. I don't think the additional complexity is useful, in what cases would we want to apply a transformation, but not replace the original loop with it?

@jsjodin
Copy link
Contributor

jsjodin commented Sep 7, 2023

Unfortunately nesting instead of the %cli object like this:

omp.unroll loop(1)
  omp.tile loop (0) { tile_sizes=[4] } {
    omp.canonical_loop for (%c0) to (%c64) step (%c1) {
      ..
    }
  }
}

is that it does not work with the apply clause. For instance:

#pragma omp tile sizes(4) apply(intratile:unroll full)
for (int i =0; i < n; ++i) {
   ...
}

where after tiling, the inner loop is unrolled.
The MLIR representation would be like this:

%cli = omp.canonical_loop %iv = [0, %tc) {
      ..
}
%grid, %intratile = omp.tile sizes(4) (%cli)
omp.unroll_full (%intratile)

If we encode the input loop (and maybe output) loop structures inside the omp ops to make it easier to see what the ops actually encode as far as loop structure e.g. loop(loop(loop())) if there are three nested loops and in the seq(loop(),loop()) for two sequential loops like in the example later on. The nested version would be:

omp.unroll_full {in_loop_structure = loop(loop()), transform_loop = 1)  { // 1 means inner loop, and just an index since there are no sequences yet. 
  omp.tile sizes(4) {in_loop_structure = loop(), transform_loop = 0) {
    omp.canonical_loop {in_loop_structure = () } {
     ....
    }
  }
}

So this works fine even if the inner loop is unrolled, since the omp.tile would transform the structure of the loop nest.

A more useful example than the above (which is just equivalent to partial unrolling by 4), would be a 2d-tiling followed with a simd-ization of one (or both) of the inner loops. Since they are constant-sized, it makes them an ideal target for vectorization.

Without the %cli reference, one could need to introduce a language that identifies the loop to be transformed a la XPath, e.g.

omp.unroll_full { something the describes the second loop in the loop nest } // `loop(1)` in https://reviews.llvm.org/D155765#4638411
omp.tile sizes(4) { 
  omp.canonical_loop %iv = [0, %tc) {
        ..
  }
}

This scheme is fragile, as the nested code could be transformed itself, e.g. one loop is empty or consists of only one iteration and is optimized away. An existing reference to a %cli would make such thing a hard error.

I don't think it is any more fragile. Recomputing the loop structure in the nesting case should catch this.

With the addition of loop sequences (see below), it is not just the nest depth, but also which of the loop in a loop sequence. E.g. "the second nested loop in the third loop of the loop sequence which is nested inside another loop".

Yes, if sequences are allowed, then it becomes more complicated, instead of a list it will be a tree. This tree will still have to be reconstructed by following the use-defs chains for doing code generation. There are error conditions that are possible with names that are not possible with nesting e.g. using the same name in two different ops, missing yield ops, ordering etc.

In the OpenMP spec, my original proposal was to allow the user to give those generates loop names (idea stolen from the xlc compiler) to avoid extensive nesting with chains of transformations. E.g.:

#pragma omp unroll on(mytiledloop)
#pragma omp tile on(myoriginalloop) sizes(4) generates(intratile:mytiledloop)

#pragma omp loopid(myoriginalloop)
for (int i =0; i < n; ++i) {
   ...
}

After big discussions on what the namespace of the loopids would be, we settled on the apply clause. They are equally powerful, but I think the loopids would allow make it easier to use, e.g. apply different transformations for each target architecture.

For loop fission, we are going to add a new notion of "canonical loop sequence" to the specification, analogous to "canonical loop nest". The loop nests built out of omp.canonical_loop are implicit, by which ones are combined in transformation such as collapse. We can do the same with sequences:

%cli1 = omp.canonical_loop %iv1 = [0, %tc) {
      ..
}
%cli2 = omp.canonical_loop %iv2 = [0, %tc) {
      ..
}
%fused = omp.fuse (%cli1, %cli2)

Like with loop nests, you should be allowed to apply another transformation on only a subset of the generated loops. For instance with loop fission:

%cli1, %cli2 = omp.canonical_loop %iv = [0, %tc) {
      ...
      omp.fissure()
      ...
}
omp.unroll (%cli2)

If there were loops before or after the omp.fissure (or both) would the implicit ones be listed first and then the ones with omp.yield? It seems a bit complicated to keep track of the various Ids and where they come from. FWIW it is still somewhat hard to see with nesting, If there are two inner loops:

omp.canonical_loop { in_loop_structure = seq(loop(), loop(), out_loop_structure = seq(loop(loop()), loop(loop())) } {
  omp.canonical_loop { in_loop_structure =(), out_loop_structure = loop() } {
  }
  omp.fissure()
  omp.canonical_loop {in_loop_structure = (), out_loop_structure = loop() } {
  }
}

I think both schemes would work if we are limited to trees of loop nests. seems more a matter of what is more convenient and practical.

Syntactically, this would be represented as

#pragma omp fission apply(nothing,unroll)
for (int i = 0; i < n; ++i) {
  ...
  #pragma omp fissure
  ...
}

@jsjodin
Copy link
Contributor

jsjodin commented Sep 7, 2023

Would the omp.canonical_loop, omp.tile etc live throughout the compilation, or would they be lowered early on in HLFIR/FIR? I am a suspicious that if these are present in the IR when running various MLIR transforms things could easily go wrong. Would it be possible to discover just the canonical loops late if they are needed for codegen going from MLIR->LLVM-IR?

@kiranchandramohan
Copy link
Contributor

Would the omp.canonical_loop, omp.tile etc live throughout the compilation, or would they be lowered early on in HLFIR/FIR? I am a suspicious that if these are present in the IR when running various MLIR transforms things could easily go wrong. Would it be possible to discover just the canonical loops late if they are needed for codegen going from MLIR->LLVM-IR?

They will live through the entire flow and be lowered by the OpenMP IRBuilder. The OpenMP IRBuilder already supports lowering of canonical loops and some transforms on it. We already use it for collapse and simd.

If you could elaborate on the issues you suspect that would be great.

@shraiysh
Copy link
Member Author

shraiysh commented Sep 7, 2023

Thank you for the discussion in the call @jsjodin @kiranchandramohan @Meinersbur! This patch is open for review now.

in what cases would we want to apply a transformation, but not replace the original loop with it?

This is a good point. I don't think there will be a case like this, so we can assume that these operations modify the loops in-place.

@jsjodin
Copy link
Contributor

jsjodin commented Sep 7, 2023

Potentially reordering might be an issue, Transforms it might think it would be okay to move code like this:

%cli = omp.canonical_loop %iv1 : i32 in [0, %tripcount) {
   store 42, array[%iv]
}
omp.unroll %cli
%t = load array[10]

to:

%cli = omp.canonical_loop %iv1 : i32 in [0, %tripcount) {
   store 42, array[%iv]
}
%t = load array[10]
omp.unroll %cli

I'm not sure if this is really an issue depending on where the loop should be generated. Is it at the unroll, or the canonical loop location? Maybe there are ways around this, and maybe there are more complicated cases with multiple loops that are harder to resolve?

@shraiysh
Copy link
Member Author

shraiysh commented Sep 7, 2023

Thanks @jsjodin, I thought about this and mentioned a comment on the omp.structured_region RFC here - https://discourse.llvm.org/t/rfc-openmp-adding-omp-structured-region/73228/3

Specifically, this point -

  • All the transforms for an omp.canonical_loop must be declared before another omp.structured_region or any other operation.

I think this check would be required as a part of the verifier.

@jsjodin
Copy link
Contributor

jsjodin commented Sep 8, 2023

Thanks @jsjodin, I thought about this and mentioned a comment on the omp.structured_region RFC here - https://discourse.llvm.org/t/rfc-openmp-adding-omp-structured-region/73228/3

Specifically, this point -

  • All the transforms for an omp.canonical_loop must be declared before another omp.structured_region or any other operation.

I think this check would be required as a part of the verifier.

If an optimization makes a change that violates the invariant will the compiler assert?

I also had another question. Is legal OpenMP?

int x = 5;
#pragma omp unroll full
for (i = 0; i < 10; i++) {
  x = x + 1;
}
return x;

Would this be represented as a canonical loop, and what would the MLIR for it look like?

@shraiysh
Copy link
Member Author

If an optimization makes a change that violates the invariant will the compiler assert?

Yes, that will be a compilation failure. @Meinersbur would it be okay if we do loop transforms within MLIR in the very beginning to avoid such optimization issues? I understand that it is not in line with the aim of using OpenMPIRBuilder for common OpenMP stuff, but it would ensure correctness.

I also had another question. Is legal OpenMP?

#pragma omp unroll full
for (i = 0; i < 10; i++) {
  x = x + 1;
}
return x;```
Would this be represented as a canonical loop, and what would the MLIR for it look like?

Not sure if it is legal, but if legal, then the transformed MLIR would be

%x = llvm.alloca i32
llvm.store i32 5, %x
%cli = omp.canonical_loop {
  omp.structured_region {
     %x1 = llvm.load %x -> i32
    %add1 = arith.add %x1, 1
    llvm.store %add1, %x
    omp.yield
  }
  omp.yield
}
omp.unroll(%cli) {"unroll_type"="full"}

@kiranchandramohan
Copy link
Contributor

If an optimization makes a change that violates the invariant will the compiler assert?

Yes, that will be a compilation failure. @Meinersbur would it be okay if we do loop transforms within MLIR in the very beginning to avoid such optimization issues? I understand that it is not in line with the aim of using OpenMPIRBuilder for common OpenMP stuff, but it would ensure correctness.

It is not clear which transformation/optimization would cause this issue. Can this be fixed by:
-> Performing the transformation at the location of the canonical loop?
-> Adding appropriate side-effects to ensure operations do not come in between the transform operations and the canonical loop?

In general, choosing a flow other than the OpenMPIRBuilder should be the last resort. It will require a discussion outside this Pull Request involving the llvm OpenMP folks (who pushed for this flow) and the MLIR team (who agreed to this flow).

@jsjodin
Copy link
Contributor

jsjodin commented Sep 11, 2023

Could an optimization that handles loads/stores potentially transform:

%x = llvm.alloca i32
llvm.store i32 5, %x
%cli = omp.canonical_loop {
  omp.structured_region {
     %x1 = llvm.load %x -> i32
    %add1 = arith.add %x1, 1
    llvm.store %add1, %x
    omp.yield
  }
  omp.yield
}
omp.unroll(%cli) {"unroll_type"="full"}
 %x2 = llvm.load %x -> i32
 func.return %x2

into:

%x = llvm.alloca i32
%cli = omp.canonical_loop {
  omp.structured_region {
    llvm.store 6, %x
    omp.yield
  }
  omp.yield
}
omp.unroll(%cli) {"unroll_type"="full"}
 %x2 = llvm.load %x -> i32
 func.return %x2

My concern is that there is an implicit loop that is not visible, any kind of mem2reg type of transform could potentially be wrong. There is a similar discussion regarding CSE, which could also be an issue here, meaning the add could be hoisted or folded out of the canonical_loop. To add these kinds of operations where the regions change the execution mode in some way e.g. changing target, multi-threading, implicit loops etc. we have to go through the analyses and optimizations and improve/add interfaces so that these ops are handled correctly.

I know this problem might exist for other dialects as well, but I'm wondering if those dialects get lowered before we get to a point where these kinds of scalar transformations become an issue?

@kiranchandramohan
Copy link
Contributor

kiranchandramohan commented Sep 11, 2023

My concern is that there is an implicit loop that is not visible, any kind of mem2reg type of transform could potentially be wrong.

Regarding the implicit nature of the loop, this is no different from the existing OpenMP worksharing loop operation. We probably have to add the LoopLikeInterface. But my understanding is that in general, side-effecting operations are not moved into or outside a region.

When a mem2reg transformation happens, the loop-carried values will be propagated and the initial value will be an operand of the loop (this is modelled as initArg for scf.for and iterArg for the fir.do_loop).

I think it might be good to move this particular discussion as a question to MLIR discourse, so we can get some input from the experts as well.

@jsjodin
Copy link
Contributor

jsjodin commented Sep 11, 2023

If an optimization makes a change that violates the invariant will the compiler assert?

Yes, that will be a compilation failure. @Meinersbur would it be okay if we do loop transforms within MLIR in the very beginning to avoid such optimization issues? I understand that it is not in line with the aim of using OpenMPIRBuilder for common OpenMP stuff, but it would ensure correctness.

It is not clear which transformation/optimization would cause this issue. Can this be fixed by:
-> Performing the transformation at the location of the canonical loop?

This seems like a reasonable approach, since the loop ops don't really execute.

-> Adding appropriate side-effects to ensure operations do not come in between the transform operations and the canonical loop?

The problem I see with this approach is that ops that might be inserted between the loop transform ops might not have any side effects.

In general, choosing a flow other than the OpenMPIRBuilder should be the last resort. It will require a discussion outside this Pull Request involving the llvm OpenMP folks (who pushed for this flow) and the MLIR team (who agreed to this flow).

I think it can be made to work. I am just trying to think potential issues so that we have more confidence that things will work. Reducing the number of invariants is also a goal, because it is makes it more difficult to reason about the correctness of code, because of a bunch extra rules of rules that need to be kept track of.

@jsjodin
Copy link
Contributor

jsjodin commented Sep 11, 2023

My concern is that there is an implicit loop that is not visible, any kind of mem2reg type of transform could potentially be wrong.
Regarding the implicit nature of the loop, this is no different from the existing OpenMP worksharing loop operation. We probably have to add the LoopLikeInterface. But my understanding is that in general, side-effecting operations are not moved into or outside a region.

When a mem2reg transformation happens, the loop-carried values will be propagated and the initial value will be an operand of the loop (this is modelled as initArg for scf.for and iterArg for the fir.do_loop).

I think it might be good to move this particular discussion as a question to MLIR discourse, so we can get some input from the experts as well.

I'm waiting to see what happens with the CSE discussions. I think it will be fine to use/add/extend interfaces to capture what is going on as long as the interfaces capture the semantics, and we don't use existing interfaces that just happen to have the effect that we want.

@jsjodin
Copy link
Contributor

jsjodin commented Sep 12, 2023

My concern is that there is an implicit loop that is not visible, any kind of mem2reg type of transform could potentially be wrong.
Regarding the implicit nature of the loop, this is no different from the existing OpenMP worksharing loop operation. We probably have to add the LoopLikeInterface. But my understanding is that in general, side-effecting operations are not moved into or outside a region.

When a mem2reg transformation happens, the loop-carried values will be propagated and the initial value will be an operand of the loop (this is modelled as initArg for scf.for and iterArg for the fir.do_loop).
I think it might be good to move this particular discussion as a question to MLIR discourse, so we can get some input from the experts as well.

I'm waiting to see what happens with the CSE discussions. I think it will be fine to use/add/extend interfaces to capture what is going on as long as the interfaces capture the semantics, and we don't use existing interfaces that just happen to have the effect that we want.

The conclusion for CSE (and other optimizations) is to make omp.target IsolatedFromAbove. This seems to be the most practical approach. This is something that may have to be done for other ops as well if optimizations cause problems because of omp op semantics.

@shraiysh
Copy link
Member Author

The IsolatedFromAbove trait also makes sense for the omp.canonical_loop operation. We just need to resolve the issue with branching in structured region with the yields. I have elaborated on this issue here

@jsjodin
Copy link
Contributor

jsjodin commented Oct 2, 2023

More discussions here: #67720

@arsenm
Copy link
Contributor

arsenm commented Feb 6, 2024

Is this still relevant?

@shraiysh shraiysh closed this Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mlir:core MLIR Core Infrastructure mlir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants