[MLIR][OpenMP] Add omp.simd operation #79843

skatrak · 2024-01-29T14:56:09Z

This patch introduces the omp.simd operation. In contrast to the existing omp.simdloop operation, it is intended to hold SIMD information within worksharing loops, rather than representing a SIMD-only loop. Some examples of such loops are "omp do/for simd", "omp distribute simd", "omp target teams distribute parallel do/for simd", etc. For more context on this work, refer to PR #79559.

This operation must always be nested within an omp.wsloop operation as its only non-terminator child. It follows the same approach as the omp.distribute operation, by serving as a simple wrapper operation holding clause information.

This patch introduces the `omp.simd` operation. In contrast to the existing `omp.simdloop` operation, it is intended to hold SIMD information within worksharing loops, rather than representing a SIMD-only loop. Some examples of such loops are "omp do/for simd", "omp distribute simd", "omp target teams distribute parallel do/for simd", etc. For more context on this work, refer to PR #79559. This operation must always be nested within an `omp.wsloop` operation as its only non-terminator child. It follows the same approach as the `omp.distribute` operation, by serving as a simple wrapper operation holding clause information.

llvmbot · 2024-01-29T14:56:43Z

@llvm/pr-subscribers-flang-openmp

@llvm/pr-subscribers-mlir

Author: Sergio Afonso (skatrak)

Changes

This patch introduces the omp.simd operation. In contrast to the existing omp.simdloop operation, it is intended to hold SIMD information within worksharing loops, rather than representing a SIMD-only loop. Some examples of such loops are "omp do/for simd", "omp distribute simd", "omp target teams distribute parallel do/for simd", etc. For more context on this work, refer to PR #79559.

This operation must always be nested within an omp.wsloop operation as its only non-terminator child. It follows the same approach as the omp.distribute operation, by serving as a simple wrapper operation holding clause information.

Patch is 28.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79843.diff

4 Files Affected:

(modified) mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td (+77-1)
(modified) mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp (+43-11)
(modified) mlir/test/Dialect/OpenMP/invalid.mlir (+214)
(modified) mlir/test/Dialect/OpenMP/ops.mlir (+181)

diff --git a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
index 96c15e775a3024b..50b316a21554e0b 100644
--- a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
+++ b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
@@ -505,6 +505,9 @@ def WsLoopOp : OpenMP_Op<"wsloop", [AttrSizedOperandSegments,
 
     /// Returns the number of reduction variables.
     unsigned getNumReductionVars() { return getReductionVars().size(); }
+
+    /// Returns its nested 'omp.simd' operation, if present.
+    SimdOp getNestedSimd();
   }];
   let hasCustomAssemblyFormat = 1;
   let assemblyFormat = [{
@@ -617,11 +620,84 @@ def SimdLoopOp : OpenMP_Op<"simdloop", [AttrSizedOperandSegments,
   let hasVerifier = 1;
 }
 
+def SimdOp : OpenMP_Op<"simd",
+    [AttrSizedOperandSegments, MemoryEffects<[MemWrite]>,
+     HasParent<"WsLoopOp">]> {
+ let summary = "simd construct";
+  let description = [{
+    The simd construct can be applied to a loop to indicate that the loop can be
+    transformed into a SIMD loop (that is, multiple iterations of the loop can
+    be executed concurrently using SIMD instructions).
+    
+    This operation is intended to hold SIMD information for a worksharing loop
+    (i.e. "omp for simd"), so it must always be nested inside of a parent
+    "omp.wsloop" operation as its only child. For SIMD loops not combined with a
+    worksharing loop (i.e. "omp simd"), the "omp.simdloop" is used instead.
+
+    The body region can contain any number of blocks. The region is terminated
+    by "omp.yield" instruction without operands.
+
+    The `alignment_values` attribute additionally specifies alignment of each
+    corresponding aligned operand. Note that `aligned_vars` and
+    `alignment_values` should contain the same number of elements.
+
+    When an if clause is present and evaluates to false, the preferred number of
+    iterations to be executed concurrently is one, regardless of whether
+    a simdlen clause is speciﬁed.
+
+    The optional `nontemporal` attribute specifies variables which have low
+    temporal locality across the iterations where they are accessed.
+
+    The optional `order` attribute specifies which order the iterations of the
+    associate loops are executed in. Currently the only option for this
+    attribute is "concurrent".
+
+    When a simdlen clause is present, the preferred number of iterations to be
+    executed concurrently is the value provided to the simdlen clause.
+
+    The safelen clause specifies that no two concurrent iterations within a
+    SIMD chunk can have a distance in the logical iteration space that is
+    greater than or equal to the value given in the clause.
+    ```
+    omp.wsloop for (%i) : index = (%c0) to (%c10) step (%c1) {
+      omp.simd <clauses> {
+        // block operations
+        omp.yield
+      }
+      omp.yield
+    ```
+  }];
+
+  // TODO: Add other clauses
+  let arguments = (ins Variadic<OpenMP_PointerLikeType>:$aligned_vars,
+             OptionalAttr<I64ArrayAttr>:$alignment_values,
+             Optional<I1>:$if_expr,
+             Variadic<OpenMP_PointerLikeType>:$nontemporal_vars,
+             OptionalAttr<OrderKindAttr>:$order_val,
+             ConfinedAttr<OptionalAttr<I64Attr>, [IntPositive]>:$simdlen,
+             ConfinedAttr<OptionalAttr<I64Attr>, [IntPositive]>:$safelen
+     );
+
+  let regions = (region AnyRegion:$region);
+  let assemblyFormat = [{
+    oilist(`aligned` `(`
+              custom<AlignedClause>($aligned_vars, type($aligned_vars),
+                                   $alignment_values) `)`
+          |`if` `(` $if_expr `)`
+          |`nontemporal` `(`  $nontemporal_vars `:` type($nontemporal_vars) `)`
+          |`order` `(` custom<ClauseAttr>($order_val) `)`
+          |`simdlen` `(` $simdlen  `)`
+          |`safelen` `(` $safelen  `)`
+    ) $region attr-dict
+  }];
+
+  let hasVerifier = 1;
+}
 
 def YieldOp : OpenMP_Op<"yield",
     [Pure, ReturnLike, Terminator,
      ParentOneOf<["WsLoopOp", "ReductionDeclareOp",
-     "AtomicUpdateOp", "SimdLoopOp"]>]> {
+     "AtomicUpdateOp", "SimdLoopOp", "SimdOp"]>]> {
   let summary = "loop yield and termination operation";
   let description = [{
     "omp.yield" yields SSA values from the OpenMP dialect op region and
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index 13cc16125a2733e..caa888d030f7019 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -1131,28 +1131,33 @@ void printLoopControl(OpAsmPrinter &p, Operation *op, Region &region,
 }
 
 //===----------------------------------------------------------------------===//
-// Verifier for Simd construct [2.9.3.1]
+// Verifier for Simd constructs [2.9.3.1]
 //===----------------------------------------------------------------------===//
 
-LogicalResult SimdLoopOp::verify() {
-  if (this->getLowerBound().empty()) {
-    return emitOpError() << "empty lowerbound for simd loop operation";
-  }
-  if (this->getSimdlen().has_value() && this->getSafelen().has_value() &&
-      this->getSimdlen().value() > this->getSafelen().value()) {
-    return emitOpError()
+template <typename OpTy>
+static LogicalResult verifySimdOp(OpTy op) {
+  if (op.getSimdlen().has_value() && op.getSafelen().has_value() &&
+      op.getSimdlen().value() > op.getSafelen().value()) {
+    return op.emitOpError()
            << "simdlen clause and safelen clause are both present, but the "
               "simdlen value is not less than or equal to safelen value";
   }
-  if (verifyAlignedClause(*this, this->getAlignmentValues(),
-                          this->getAlignedVars())
+  if (verifyAlignedClause(op, op.getAlignmentValues(), op.getAlignedVars())
           .failed())
     return failure();
-  if (verifyNontemporalClause(*this, this->getNontemporalVars()).failed())
+  if (verifyNontemporalClause(op, op.getNontemporalVars()).failed())
     return failure();
   return success();
 }
 
+LogicalResult SimdLoopOp::verify() {
+  if (this->getLowerBound().empty())
+    return emitOpError() << "empty lowerbound for simd loop operation";
+  return verifySimdOp(*this);
+}
+
+LogicalResult SimdOp::verify() { return verifySimdOp(*this); }
+
 //===----------------------------------------------------------------------===//
 // Verifier for Distribute construct [2.9.4.1]
 //===----------------------------------------------------------------------===//
@@ -1329,7 +1334,34 @@ void WsLoopOp::build(OpBuilder &builder, OperationState &state,
   state.addAttributes(attributes);
 }
 
+SimdOp WsLoopOp::getNestedSimd() {
+  auto ops = this->getOps<SimdOp>();
+  assert(std::distance(ops.begin(), ops.end()) <= 1 &&
+         "There can only be a single omp.simd child at most");
+  return ops.empty() ? SimdOp() : *ops.begin();
+}
+
 LogicalResult WsLoopOp::verify() {
+  // Check that, if it has an omp.simd child, it must be the only one.
+  bool hasSimd = false, hasOther = false;
+  for (auto &op : this->getOps()) {
+    if (isa<SimdOp>(op)) {
+      if (hasSimd)
+        return emitOpError() << "cannot have multiple 'omp.simd' child ops";
+      hasSimd = true;
+
+      if (hasOther)
+        break;
+    } else if (!op.hasTrait<OpTrait::IsTerminator>()) {
+      hasOther = true;
+      if (hasSimd)
+        break;
+    }
+  }
+  if (hasSimd && hasOther)
+    return emitOpError() << "if 'omp.simd' is a child, it must be the only "
+                            "non-terminator child op";
+
   return verifyReductionVarList(*this, getReductions(), getReductionVars());
 }
 
diff --git a/mlir/test/Dialect/OpenMP/invalid.mlir b/mlir/test/Dialect/OpenMP/invalid.mlir
index 812b79e35595f04..29a6a078fad8344 100644
--- a/mlir/test/Dialect/OpenMP/invalid.mlir
+++ b/mlir/test/Dialect/OpenMP/invalid.mlir
@@ -192,6 +192,50 @@ llvm.func @test_omp_wsloop_dynamic_wrong_modifier3(%lb : i64, %ub : i64, %step :
 
 // -----
 
+llvm.func @test_omp_wsloop_simd_multiple(%lb : i64, %ub : i64, %step : i64) -> () {
+  // expected-error @+1 {{op cannot have multiple 'omp.simd' child ops}}
+  omp.wsloop for (%iv) : i64 = (%lb) to (%ub) step (%step) {
+    omp.simd {
+      omp.yield
+    }
+    omp.simd {
+      omp.yield
+    }
+    omp.yield
+  }
+  llvm.return
+}
+
+// -----
+
+llvm.func @test_omp_wsloop_simd_invalid_before(%lb : i64, %ub : i64, %step : i64) -> () {
+  // expected-error @+1 {{op if 'omp.simd' is a child, it must be the only non-terminator child op}}
+  omp.wsloop for (%iv) : i64 = (%lb) to (%ub) step (%step) {
+    %c1 = arith.constant 1 : i32
+    omp.simd {
+      omp.yield
+    }
+    omp.yield
+  }
+  llvm.return
+}
+
+// -----
+
+llvm.func @test_omp_wsloop_simd_invalid_after(%lb : i64, %ub : i64, %step : i64) -> () {
+  // expected-error @+1 {{op if 'omp.simd' is a child, it must be the only non-terminator child op}}
+  omp.wsloop for (%iv) : i64 = (%lb) to (%ub) step (%step) {
+    omp.simd {
+      omp.yield
+    }
+    %c1 = arith.constant 1 : i32
+    omp.yield
+  }
+  llvm.return
+}
+
+// -----
+
 func.func @omp_simdloop(%lb : index, %ub : index, %step : i32) -> () {
   // expected-error @below {{op failed to verify that all of {lowerBound, upperBound, step} have same type}}
   "omp.simdloop" (%lb, %ub, %step) ({
@@ -205,6 +249,18 @@ func.func @omp_simdloop(%lb : index, %ub : index, %step : i32) -> () {
 
 // -----
 
+func.func @omp_simd(%lb : index, %ub : index, %step : i32) -> () {
+  // expected-error @below {{'omp.simd' op expects parent op 'omp.wsloop'}}
+  "omp.simd" () ({
+    ^bb0(%iv: index):
+      omp.yield
+  }) {operandSegmentSizes = array<i32: 0, 0, 0>} : () -> ()
+
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_pretty_aligned(%lb : index, %ub : index, %step : index,
                                        %data_var : memref<i32>) -> () {
   //  expected-error @below {{expected '->'}}
@@ -217,6 +273,20 @@ func.func @omp_simdloop_pretty_aligned(%lb : index, %ub : index, %step : index,
 
 // -----
 
+func.func @omp_simd_pretty_aligned(%lb : index, %ub : index, %step : index,
+                                   %data_var : memref<i32>) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    //  expected-error @below {{expected '->'}}
+    omp.simd aligned(%data_var : memref<i32>) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_aligned_mismatch(%arg0 : index, %arg1 : index,
                                          %arg2 : index, %arg3 : memref<i32>,
                                          %arg4 : memref<i32>) -> () {
@@ -231,6 +301,22 @@ func.func @omp_simdloop_aligned_mismatch(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_aligned_mismatch(%arg0 : index, %arg1 : index,
+                                     %arg2 : index, %arg3 : memref<i32>,
+                                     %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{op expected as many alignment values as aligned variables}}
+    "omp.simd"(%arg3, %arg4) ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [128],
+        operandSegmentSizes = array<i32: 2, 0, 0>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_aligned_negative(%arg0 : index, %arg1 : index,
                                          %arg2 : index, %arg3 : memref<i32>,
                                          %arg4 : memref<i32>) -> () {
@@ -244,6 +330,21 @@ func.func @omp_simdloop_aligned_negative(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_aligned_negative(%arg0 : index, %arg1 : index,
+                                     %arg2 : index, %arg3 : memref<i32>,
+                                     %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{op alignment should be greater than 0}}
+    "omp.simd"(%arg3, %arg4) ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [-1, 128], operandSegmentSizes = array<i32: 2, 0, 0>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_unexpected_alignment(%arg0 : index, %arg1 : index,
                                              %arg2 : index, %arg3 : memref<i32>,
                                              %arg4 : memref<i32>) -> () {
@@ -257,6 +358,21 @@ func.func @omp_simdloop_unexpected_alignment(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_unexpected_alignment(%arg0 : index, %arg1 : index,
+                                             %arg2 : index, %arg3 : memref<i32>,
+                                             %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{unexpected alignment values attribute}}
+    "omp.simd"() ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [1, 128], operandSegmentSizes = array<i32: 0, 0, 0>} : () -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_aligned_float(%arg0 : index, %arg1 : index,
                                       %arg2 : index, %arg3 : memref<i32>,
                                       %arg4 : memref<i32>) -> () {
@@ -270,6 +386,21 @@ func.func @omp_simdloop_aligned_float(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_aligned_float(%arg0 : index, %arg1 : index,
+                                  %arg2 : index, %arg3 : memref<i32>,
+                                  %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{failed to satisfy constraint: 64-bit integer array attribute}}
+    "omp.simd"(%arg3, %arg4) ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [1.5, 128], operandSegmentSizes = array<i32: 2, 0, 0>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_aligned_the_same_var(%arg0 : index, %arg1 : index,
                                              %arg2 : index, %arg3 : memref<i32>,
                                              %arg4 : memref<i32>) -> () {
@@ -283,6 +414,21 @@ func.func @omp_simdloop_aligned_the_same_var(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_aligned_the_same_var(%arg0 : index, %arg1 : index,
+                                         %arg2 : index, %arg3 : memref<i32>,
+                                         %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{aligned variable used more than once}}
+    "omp.simd"(%arg3, %arg3) ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [1, 128], operandSegmentSizes = array<i32: 2, 0, 0>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_nontemporal_the_same_var(%arg0 : index,
                                                  %arg1 : index,
                                                  %arg2 : index,
@@ -297,6 +443,22 @@ func.func @omp_simdloop_nontemporal_the_same_var(%arg0 : index,
 
 // -----
 
+func.func @omp_simd_nontemporal_the_same_var(%arg0 : index,
+                                             %arg1 : index,
+                                             %arg2 : index,
+                                             %arg3 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{nontemporal variable used more than once}}
+    "omp.simd"(%arg3, %arg3) ({
+      "omp.yield"() : () -> ()
+    }) {operandSegmentSizes = array<i32: 0, 0, 2>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_order_value(%lb : index, %ub : index, %step : index) {
   // expected-error @below {{invalid clause value: 'default'}}
   omp.simdloop order(default) for (%iv): index = (%lb) to (%ub) step (%step) {
@@ -307,6 +469,19 @@ func.func @omp_simdloop_order_value(%lb : index, %ub : index, %step : index) {
 
 // -----
 
+func.func @omp_simd_order_value(%lb : index, %ub : index, %step : index) {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // expected-error @below {{invalid clause value: 'default'}}
+    omp.simd order(default) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_pretty_simdlen(%lb : index, %ub : index, %step : index) -> () {
   // expected-error @below {{op attribute 'simdlen' failed to satisfy constraint: 64-bit signless integer attribute whose value is positive}}
   omp.simdloop simdlen(0) for (%iv): index = (%lb) to (%ub) step (%step) {
@@ -317,6 +492,19 @@ func.func @omp_simdloop_pretty_simdlen(%lb : index, %ub : index, %step : index)
 
 // -----
 
+func.func @omp_simd_pretty_simdlen(%lb : index, %ub : index, %step : index) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // expected-error @below {{op attribute 'simdlen' failed to satisfy constraint: 64-bit signless integer attribute whose value is positive}}
+    omp.simd simdlen(0) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_pretty_safelen(%lb : index, %ub : index, %step : index) -> () {
   // expected-error @below {{op attribute 'safelen' failed to satisfy constraint: 64-bit signless integer attribute whose value is positive}}
   omp.simdloop safelen(0) for (%iv): index = (%lb) to (%ub) step (%step) {
@@ -327,6 +515,19 @@ func.func @omp_simdloop_pretty_safelen(%lb : index, %ub : index, %step : index)
 
 // -----
 
+func.func @omp_simd_pretty_safelen(%lb : index, %ub : index, %step : index) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // expected-error @below {{op attribute 'safelen' failed to satisfy constraint: 64-bit signless integer attribute whose value is positive}}
+    omp.simd safelen(0) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_pretty_simdlen_safelen(%lb : index, %ub : index, %step : index) -> () {
   // expected-error @below {{'omp.simdloop' op simdlen clause and safelen clause are both present, but the simdlen value is not less than or equal to safelen value}}
   omp.simdloop simdlen(2) safelen(1) for (%iv): index = (%lb) to (%ub) step (%step) {
@@ -337,6 +538,19 @@ func.func @omp_simdloop_pretty_simdlen_safelen(%lb : index, %ub : index, %step :
 
 // -----
 
+func.func @omp_simd_pretty_simdlen_safelen(%lb : index, %ub : index, %step : index) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // expected-error @below {{'omp.simd' op simdlen clause and safelen clause are both present, but the simdlen value is not less than or equal to safelen value}}
+    omp.simd simdlen(2) safelen(1) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 // expected-error @below {{op expects initializer region with one argument of the reduction type}}
 omp.reduction.declare @add_f32 : f64
 init {
diff --git a/mlir/test/Dialect/OpenMP/ops.mlir b/mlir/test/Dialect/OpenMP/ops.mlir
index ccf72ae31d439ed..2487f0eb5654e13 100644
--- a/mlir/test/Dialect/OpenMP/ops.mlir
+++ b/mlir/test/Dialect/OpenMP/ops.mlir
@@ -176,6 +176,21 @@ func.func @omp_wsloop(%lb : index, %ub : index, %step : index, %data_var : memre
   }) {operandSegmentSizes = array<i32: 1,1,1,0,0,0,0>, nowait, schedule_val = #omp<schedulekind auto>} :
     (index, index, index) -> ()
 
+  // CHECK: omp.wsloop
+  // CHECK-SAME: for (%{{.*}}) : index = (%{{.*}}) to (%{{.*}}) step (%{{.*}})
+  // CHECK-NEXT: omp.simd
+  // CHECK-NEXT: omp.yield
+  // CHECK-NEXT: }
+  // CHECK-NEXT: omp.yield
+  "omp.wsloop" (%lb, %ub, %step) ({
+    ^bb0(%iv: index):
+      "omp.simd" () ({
+        omp.yield
+      }) : () -> ()
+      omp.yield
+  }) {operandSegmentSizes = array<i32: 1,1,1,0,0,0,0>} :
+    (index, index, index) -> ()
+
   return
 }
 
@@ -339,6 +354,19 @@ func.func @omp_simdloop(%lb : index, %ub : index, %step : index) -> () {
   return
 }
 
+// CHECK-LABEL: omp_simd
+func.func @omp_simd(%lb : index, %ub : index, %step : index) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // CHECK: omp.simd {
+    "omp.simd" () ({
+      omp.yield
+    }) {operandSeg...
[truncated]

llvmbot · 2024-01-29T14:56:43Z

@llvm/pr-subscribers-mlir-openmp

Author: Sergio Afonso (skatrak)

Changes

This patch introduces the omp.simd operation. In contrast to the existing omp.simdloop operation, it is intended to hold SIMD information within worksharing loops, rather than representing a SIMD-only loop. Some examples of such loops are "omp do/for simd", "omp distribute simd", "omp target teams distribute parallel do/for simd", etc. For more context on this work, refer to PR #79559.

This operation must always be nested within an omp.wsloop operation as its only non-terminator child. It follows the same approach as the omp.distribute operation, by serving as a simple wrapper operation holding clause information.

Patch is 28.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79843.diff

4 Files Affected:

(modified) mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td (+77-1)
(modified) mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp (+43-11)
(modified) mlir/test/Dialect/OpenMP/invalid.mlir (+214)
(modified) mlir/test/Dialect/OpenMP/ops.mlir (+181)

diff --git a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
index 96c15e775a3024b..50b316a21554e0b 100644
--- a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
+++ b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
@@ -505,6 +505,9 @@ def WsLoopOp : OpenMP_Op<"wsloop", [AttrSizedOperandSegments,
 
     /// Returns the number of reduction variables.
     unsigned getNumReductionVars() { return getReductionVars().size(); }
+
+    /// Returns its nested 'omp.simd' operation, if present.
+    SimdOp getNestedSimd();
   }];
   let hasCustomAssemblyFormat = 1;
   let assemblyFormat = [{
@@ -617,11 +620,84 @@ def SimdLoopOp : OpenMP_Op<"simdloop", [AttrSizedOperandSegments,
   let hasVerifier = 1;
 }
 
+def SimdOp : OpenMP_Op<"simd",
+    [AttrSizedOperandSegments, MemoryEffects<[MemWrite]>,
+     HasParent<"WsLoopOp">]> {
+ let summary = "simd construct";
+  let description = [{
+    The simd construct can be applied to a loop to indicate that the loop can be
+    transformed into a SIMD loop (that is, multiple iterations of the loop can
+    be executed concurrently using SIMD instructions).
+    
+    This operation is intended to hold SIMD information for a worksharing loop
+    (i.e. "omp for simd"), so it must always be nested inside of a parent
+    "omp.wsloop" operation as its only child. For SIMD loops not combined with a
+    worksharing loop (i.e. "omp simd"), the "omp.simdloop" is used instead.
+
+    The body region can contain any number of blocks. The region is terminated
+    by "omp.yield" instruction without operands.
+
+    The `alignment_values` attribute additionally specifies alignment of each
+    corresponding aligned operand. Note that `aligned_vars` and
+    `alignment_values` should contain the same number of elements.
+
+    When an if clause is present and evaluates to false, the preferred number of
+    iterations to be executed concurrently is one, regardless of whether
+    a simdlen clause is speciﬁed.
+
+    The optional `nontemporal` attribute specifies variables which have low
+    temporal locality across the iterations where they are accessed.
+
+    The optional `order` attribute specifies which order the iterations of the
+    associate loops are executed in. Currently the only option for this
+    attribute is "concurrent".
+
+    When a simdlen clause is present, the preferred number of iterations to be
+    executed concurrently is the value provided to the simdlen clause.
+
+    The safelen clause specifies that no two concurrent iterations within a
+    SIMD chunk can have a distance in the logical iteration space that is
+    greater than or equal to the value given in the clause.
+    ```
+    omp.wsloop for (%i) : index = (%c0) to (%c10) step (%c1) {
+      omp.simd <clauses> {
+        // block operations
+        omp.yield
+      }
+      omp.yield
+    ```
+  }];
+
+  // TODO: Add other clauses
+  let arguments = (ins Variadic<OpenMP_PointerLikeType>:$aligned_vars,
+             OptionalAttr<I64ArrayAttr>:$alignment_values,
+             Optional<I1>:$if_expr,
+             Variadic<OpenMP_PointerLikeType>:$nontemporal_vars,
+             OptionalAttr<OrderKindAttr>:$order_val,
+             ConfinedAttr<OptionalAttr<I64Attr>, [IntPositive]>:$simdlen,
+             ConfinedAttr<OptionalAttr<I64Attr>, [IntPositive]>:$safelen
+     );
+
+  let regions = (region AnyRegion:$region);
+  let assemblyFormat = [{
+    oilist(`aligned` `(`
+              custom<AlignedClause>($aligned_vars, type($aligned_vars),
+                                   $alignment_values) `)`
+          |`if` `(` $if_expr `)`
+          |`nontemporal` `(`  $nontemporal_vars `:` type($nontemporal_vars) `)`
+          |`order` `(` custom<ClauseAttr>($order_val) `)`
+          |`simdlen` `(` $simdlen  `)`
+          |`safelen` `(` $safelen  `)`
+    ) $region attr-dict
+  }];
+
+  let hasVerifier = 1;
+}
 
 def YieldOp : OpenMP_Op<"yield",
     [Pure, ReturnLike, Terminator,
      ParentOneOf<["WsLoopOp", "ReductionDeclareOp",
-     "AtomicUpdateOp", "SimdLoopOp"]>]> {
+     "AtomicUpdateOp", "SimdLoopOp", "SimdOp"]>]> {
   let summary = "loop yield and termination operation";
   let description = [{
     "omp.yield" yields SSA values from the OpenMP dialect op region and
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index 13cc16125a2733e..caa888d030f7019 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -1131,28 +1131,33 @@ void printLoopControl(OpAsmPrinter &p, Operation *op, Region &region,
 }
 
 //===----------------------------------------------------------------------===//
-// Verifier for Simd construct [2.9.3.1]
+// Verifier for Simd constructs [2.9.3.1]
 //===----------------------------------------------------------------------===//
 
-LogicalResult SimdLoopOp::verify() {
-  if (this->getLowerBound().empty()) {
-    return emitOpError() << "empty lowerbound for simd loop operation";
-  }
-  if (this->getSimdlen().has_value() && this->getSafelen().has_value() &&
-      this->getSimdlen().value() > this->getSafelen().value()) {
-    return emitOpError()
+template <typename OpTy>
+static LogicalResult verifySimdOp(OpTy op) {
+  if (op.getSimdlen().has_value() && op.getSafelen().has_value() &&
+      op.getSimdlen().value() > op.getSafelen().value()) {
+    return op.emitOpError()
            << "simdlen clause and safelen clause are both present, but the "
               "simdlen value is not less than or equal to safelen value";
   }
-  if (verifyAlignedClause(*this, this->getAlignmentValues(),
-                          this->getAlignedVars())
+  if (verifyAlignedClause(op, op.getAlignmentValues(), op.getAlignedVars())
           .failed())
     return failure();
-  if (verifyNontemporalClause(*this, this->getNontemporalVars()).failed())
+  if (verifyNontemporalClause(op, op.getNontemporalVars()).failed())
     return failure();
   return success();
 }
 
+LogicalResult SimdLoopOp::verify() {
+  if (this->getLowerBound().empty())
+    return emitOpError() << "empty lowerbound for simd loop operation";
+  return verifySimdOp(*this);
+}
+
+LogicalResult SimdOp::verify() { return verifySimdOp(*this); }
+
 //===----------------------------------------------------------------------===//
 // Verifier for Distribute construct [2.9.4.1]
 //===----------------------------------------------------------------------===//
@@ -1329,7 +1334,34 @@ void WsLoopOp::build(OpBuilder &builder, OperationState &state,
   state.addAttributes(attributes);
 }
 
+SimdOp WsLoopOp::getNestedSimd() {
+  auto ops = this->getOps<SimdOp>();
+  assert(std::distance(ops.begin(), ops.end()) <= 1 &&
+         "There can only be a single omp.simd child at most");
+  return ops.empty() ? SimdOp() : *ops.begin();
+}
+
 LogicalResult WsLoopOp::verify() {
+  // Check that, if it has an omp.simd child, it must be the only one.
+  bool hasSimd = false, hasOther = false;
+  for (auto &op : this->getOps()) {
+    if (isa<SimdOp>(op)) {
+      if (hasSimd)
+        return emitOpError() << "cannot have multiple 'omp.simd' child ops";
+      hasSimd = true;
+
+      if (hasOther)
+        break;
+    } else if (!op.hasTrait<OpTrait::IsTerminator>()) {
+      hasOther = true;
+      if (hasSimd)
+        break;
+    }
+  }
+  if (hasSimd && hasOther)
+    return emitOpError() << "if 'omp.simd' is a child, it must be the only "
+                            "non-terminator child op";
+
   return verifyReductionVarList(*this, getReductions(), getReductionVars());
 }
 
diff --git a/mlir/test/Dialect/OpenMP/invalid.mlir b/mlir/test/Dialect/OpenMP/invalid.mlir
index 812b79e35595f04..29a6a078fad8344 100644
--- a/mlir/test/Dialect/OpenMP/invalid.mlir
+++ b/mlir/test/Dialect/OpenMP/invalid.mlir
@@ -192,6 +192,50 @@ llvm.func @test_omp_wsloop_dynamic_wrong_modifier3(%lb : i64, %ub : i64, %step :
 
 // -----
 
+llvm.func @test_omp_wsloop_simd_multiple(%lb : i64, %ub : i64, %step : i64) -> () {
+  // expected-error @+1 {{op cannot have multiple 'omp.simd' child ops}}
+  omp.wsloop for (%iv) : i64 = (%lb) to (%ub) step (%step) {
+    omp.simd {
+      omp.yield
+    }
+    omp.simd {
+      omp.yield
+    }
+    omp.yield
+  }
+  llvm.return
+}
+
+// -----
+
+llvm.func @test_omp_wsloop_simd_invalid_before(%lb : i64, %ub : i64, %step : i64) -> () {
+  // expected-error @+1 {{op if 'omp.simd' is a child, it must be the only non-terminator child op}}
+  omp.wsloop for (%iv) : i64 = (%lb) to (%ub) step (%step) {
+    %c1 = arith.constant 1 : i32
+    omp.simd {
+      omp.yield
+    }
+    omp.yield
+  }
+  llvm.return
+}
+
+// -----
+
+llvm.func @test_omp_wsloop_simd_invalid_after(%lb : i64, %ub : i64, %step : i64) -> () {
+  // expected-error @+1 {{op if 'omp.simd' is a child, it must be the only non-terminator child op}}
+  omp.wsloop for (%iv) : i64 = (%lb) to (%ub) step (%step) {
+    omp.simd {
+      omp.yield
+    }
+    %c1 = arith.constant 1 : i32
+    omp.yield
+  }
+  llvm.return
+}
+
+// -----
+
 func.func @omp_simdloop(%lb : index, %ub : index, %step : i32) -> () {
   // expected-error @below {{op failed to verify that all of {lowerBound, upperBound, step} have same type}}
   "omp.simdloop" (%lb, %ub, %step) ({
@@ -205,6 +249,18 @@ func.func @omp_simdloop(%lb : index, %ub : index, %step : i32) -> () {
 
 // -----
 
+func.func @omp_simd(%lb : index, %ub : index, %step : i32) -> () {
+  // expected-error @below {{'omp.simd' op expects parent op 'omp.wsloop'}}
+  "omp.simd" () ({
+    ^bb0(%iv: index):
+      omp.yield
+  }) {operandSegmentSizes = array<i32: 0, 0, 0>} : () -> ()
+
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_pretty_aligned(%lb : index, %ub : index, %step : index,
                                        %data_var : memref<i32>) -> () {
   //  expected-error @below {{expected '->'}}
@@ -217,6 +273,20 @@ func.func @omp_simdloop_pretty_aligned(%lb : index, %ub : index, %step : index,
 
 // -----
 
+func.func @omp_simd_pretty_aligned(%lb : index, %ub : index, %step : index,
+                                   %data_var : memref<i32>) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    //  expected-error @below {{expected '->'}}
+    omp.simd aligned(%data_var : memref<i32>) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_aligned_mismatch(%arg0 : index, %arg1 : index,
                                          %arg2 : index, %arg3 : memref<i32>,
                                          %arg4 : memref<i32>) -> () {
@@ -231,6 +301,22 @@ func.func @omp_simdloop_aligned_mismatch(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_aligned_mismatch(%arg0 : index, %arg1 : index,
+                                     %arg2 : index, %arg3 : memref<i32>,
+                                     %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{op expected as many alignment values as aligned variables}}
+    "omp.simd"(%arg3, %arg4) ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [128],
+        operandSegmentSizes = array<i32: 2, 0, 0>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_aligned_negative(%arg0 : index, %arg1 : index,
                                          %arg2 : index, %arg3 : memref<i32>,
                                          %arg4 : memref<i32>) -> () {
@@ -244,6 +330,21 @@ func.func @omp_simdloop_aligned_negative(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_aligned_negative(%arg0 : index, %arg1 : index,
+                                     %arg2 : index, %arg3 : memref<i32>,
+                                     %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{op alignment should be greater than 0}}
+    "omp.simd"(%arg3, %arg4) ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [-1, 128], operandSegmentSizes = array<i32: 2, 0, 0>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_unexpected_alignment(%arg0 : index, %arg1 : index,
                                              %arg2 : index, %arg3 : memref<i32>,
                                              %arg4 : memref<i32>) -> () {
@@ -257,6 +358,21 @@ func.func @omp_simdloop_unexpected_alignment(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_unexpected_alignment(%arg0 : index, %arg1 : index,
+                                             %arg2 : index, %arg3 : memref<i32>,
+                                             %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{unexpected alignment values attribute}}
+    "omp.simd"() ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [1, 128], operandSegmentSizes = array<i32: 0, 0, 0>} : () -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_aligned_float(%arg0 : index, %arg1 : index,
                                       %arg2 : index, %arg3 : memref<i32>,
                                       %arg4 : memref<i32>) -> () {
@@ -270,6 +386,21 @@ func.func @omp_simdloop_aligned_float(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_aligned_float(%arg0 : index, %arg1 : index,
+                                  %arg2 : index, %arg3 : memref<i32>,
+                                  %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{failed to satisfy constraint: 64-bit integer array attribute}}
+    "omp.simd"(%arg3, %arg4) ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [1.5, 128], operandSegmentSizes = array<i32: 2, 0, 0>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_aligned_the_same_var(%arg0 : index, %arg1 : index,
                                              %arg2 : index, %arg3 : memref<i32>,
                                              %arg4 : memref<i32>) -> () {
@@ -283,6 +414,21 @@ func.func @omp_simdloop_aligned_the_same_var(%arg0 : index, %arg1 : index,
 
 // -----
 
+func.func @omp_simd_aligned_the_same_var(%arg0 : index, %arg1 : index,
+                                         %arg2 : index, %arg3 : memref<i32>,
+                                         %arg4 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{aligned variable used more than once}}
+    "omp.simd"(%arg3, %arg3) ({
+      "omp.yield"() : () -> ()
+    }) {alignment_values = [1, 128], operandSegmentSizes = array<i32: 2, 0, 0>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_nontemporal_the_same_var(%arg0 : index,
                                                  %arg1 : index,
                                                  %arg2 : index,
@@ -297,6 +443,22 @@ func.func @omp_simdloop_nontemporal_the_same_var(%arg0 : index,
 
 // -----
 
+func.func @omp_simd_nontemporal_the_same_var(%arg0 : index,
+                                             %arg1 : index,
+                                             %arg2 : index,
+                                             %arg3 : memref<i32>) -> () {
+  omp.wsloop for (%arg5) : index = (%arg0) to (%arg1) step (%arg2) {
+    //  expected-error @below {{nontemporal variable used more than once}}
+    "omp.simd"(%arg3, %arg3) ({
+      "omp.yield"() : () -> ()
+    }) {operandSegmentSizes = array<i32: 0, 0, 2>} : (memref<i32>, memref<i32>) -> ()
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_order_value(%lb : index, %ub : index, %step : index) {
   // expected-error @below {{invalid clause value: 'default'}}
   omp.simdloop order(default) for (%iv): index = (%lb) to (%ub) step (%step) {
@@ -307,6 +469,19 @@ func.func @omp_simdloop_order_value(%lb : index, %ub : index, %step : index) {
 
 // -----
 
+func.func @omp_simd_order_value(%lb : index, %ub : index, %step : index) {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // expected-error @below {{invalid clause value: 'default'}}
+    omp.simd order(default) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_pretty_simdlen(%lb : index, %ub : index, %step : index) -> () {
   // expected-error @below {{op attribute 'simdlen' failed to satisfy constraint: 64-bit signless integer attribute whose value is positive}}
   omp.simdloop simdlen(0) for (%iv): index = (%lb) to (%ub) step (%step) {
@@ -317,6 +492,19 @@ func.func @omp_simdloop_pretty_simdlen(%lb : index, %ub : index, %step : index)
 
 // -----
 
+func.func @omp_simd_pretty_simdlen(%lb : index, %ub : index, %step : index) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // expected-error @below {{op attribute 'simdlen' failed to satisfy constraint: 64-bit signless integer attribute whose value is positive}}
+    omp.simd simdlen(0) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_pretty_safelen(%lb : index, %ub : index, %step : index) -> () {
   // expected-error @below {{op attribute 'safelen' failed to satisfy constraint: 64-bit signless integer attribute whose value is positive}}
   omp.simdloop safelen(0) for (%iv): index = (%lb) to (%ub) step (%step) {
@@ -327,6 +515,19 @@ func.func @omp_simdloop_pretty_safelen(%lb : index, %ub : index, %step : index)
 
 // -----
 
+func.func @omp_simd_pretty_safelen(%lb : index, %ub : index, %step : index) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // expected-error @below {{op attribute 'safelen' failed to satisfy constraint: 64-bit signless integer attribute whose value is positive}}
+    omp.simd safelen(0) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 func.func @omp_simdloop_pretty_simdlen_safelen(%lb : index, %ub : index, %step : index) -> () {
   // expected-error @below {{'omp.simdloop' op simdlen clause and safelen clause are both present, but the simdlen value is not less than or equal to safelen value}}
   omp.simdloop simdlen(2) safelen(1) for (%iv): index = (%lb) to (%ub) step (%step) {
@@ -337,6 +538,19 @@ func.func @omp_simdloop_pretty_simdlen_safelen(%lb : index, %ub : index, %step :
 
 // -----
 
+func.func @omp_simd_pretty_simdlen_safelen(%lb : index, %ub : index, %step : index) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // expected-error @below {{'omp.simd' op simdlen clause and safelen clause are both present, but the simdlen value is not less than or equal to safelen value}}
+    omp.simd simdlen(2) safelen(1) {
+      omp.yield
+    }
+    omp.yield
+  }
+  return
+}
+
+// -----
+
 // expected-error @below {{op expects initializer region with one argument of the reduction type}}
 omp.reduction.declare @add_f32 : f64
 init {
diff --git a/mlir/test/Dialect/OpenMP/ops.mlir b/mlir/test/Dialect/OpenMP/ops.mlir
index ccf72ae31d439ed..2487f0eb5654e13 100644
--- a/mlir/test/Dialect/OpenMP/ops.mlir
+++ b/mlir/test/Dialect/OpenMP/ops.mlir
@@ -176,6 +176,21 @@ func.func @omp_wsloop(%lb : index, %ub : index, %step : index, %data_var : memre
   }) {operandSegmentSizes = array<i32: 1,1,1,0,0,0,0>, nowait, schedule_val = #omp<schedulekind auto>} :
     (index, index, index) -> ()
 
+  // CHECK: omp.wsloop
+  // CHECK-SAME: for (%{{.*}}) : index = (%{{.*}}) to (%{{.*}}) step (%{{.*}})
+  // CHECK-NEXT: omp.simd
+  // CHECK-NEXT: omp.yield
+  // CHECK-NEXT: }
+  // CHECK-NEXT: omp.yield
+  "omp.wsloop" (%lb, %ub, %step) ({
+    ^bb0(%iv: index):
+      "omp.simd" () ({
+        omp.yield
+      }) : () -> ()
+      omp.yield
+  }) {operandSegmentSizes = array<i32: 1,1,1,0,0,0,0>} :
+    (index, index, index) -> ()
+
   return
 }
 
@@ -339,6 +354,19 @@ func.func @omp_simdloop(%lb : index, %ub : index, %step : index) -> () {
   return
 }
 
+// CHECK-LABEL: omp_simd
+func.func @omp_simd(%lb : index, %ub : index, %step : index) -> () {
+  omp.wsloop for (%iv) : index = (%lb) to (%ub) step (%step) {
+    // CHECK: omp.simd {
+    "omp.simd" () ({
+      omp.yield
+    }) {operandSeg...
[truncated]

DominikAdamski

OpenMP standard: https://www.openmp.org/spec-html/5.0/openmpsu42.html adds some restrictions for ordered, linear and firstprivate clauses for omp simd do.
I see no test for these restrictions. Could you add it to the verifier? ordered , linear and firstprivate relate to omp do: https://www.openmp.org/spec-html/5.0/openmpsu41.html

kiranchandramohan · 2024-01-29T16:11:00Z

Please point to the relevant section in the standard.
Would simd as an attribute be better?
Or could worksharing-loop simd be a separate operation? Worksharing-loop simd is a composite construct so this might make sense here.
How did you arrive at the set of clauses supported?

DominikAdamski · 2024-01-30T14:16:08Z

mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp

@@ -1131,28 +1131,33 @@ void printLoopControl(OpAsmPrinter &p, Operation *op, Region &region,
 }

 //===----------------------------------------------------------------------===//
-// Verifier for Simd construct [2.9.3.1]
+// Verifier for Simd constructs [2.9.3.1]


Nit: 2.9.3.1 refers to omp simd construct. Could you add referral to omp do simd as well? It's 2.9.3.2

skatrak · 2024-01-30T17:27:26Z

Thanks Kiran for the feedback, I'll try to answer your concerns below.

Please point to the relevant section in the standard.

The section describing this in the 5.0 spec is 2.9.3.2, as @DominikAdamski pointed out. I'll add references to it to the relevant places.

Would simd as an attribute be better?

Or could worksharing-loop simd be a separate operation? Worksharing-loop simd is a composite construct so this might make sense here.

I think these are some of the options that we were considering, so it's a matter of choosing what seems more reasonable to everyone. The representation proposed here is something like the following:

omp.wsloop for (%i) : index = (%lb) to (%up) step (%step) <workshare-loop-specific clauses> {
  omp.simd <simd-specific clauses> {
    ...
    omp.yield
  }
  omp.yield
}

The idea behind that is to follow what the same approach as the omp.distribute operation. It works as a wrapper that is associated to some loop. In this case it goes nested inside holding the loop body rather than around the loop, according to the order in which the various subdivisions of work happen.

One other thing that could be done is to extend the omp.wsloop operation to accept the various attributes to represent the clauses that would be applied to the SIMD construct as well. I think this is what your second point refers to. We could have it looking something like this:

omp.wsloop for simd (%i) : index = (%lb) to (%ub) step (%step) <workshare-loop + simd clauses> {
  ...
  omp.yield
}

The addition of "simd" after "for" is a way to convey there being an MLIR UnitAttr, boolean or something like this. Then, SIMD-specific arguments and attributes would be rejected by the verifier if they appear on a non-simd variant of the operation. The obvious problem with this is that it pollutes omp.wsloop and sort of also make it represent different things. We moved away from this in the omp.distribute discussions early on and I think it's the same concerns here.

I'm not against using ops to represent composite constructs, since they have their own unique behavior. So, in this case, my understanding is that it should look similar to this:

omp.wsloop_simd for (%i) : index = (%lb) to (%ub) step (%step) <workshare-loop-simd-specific clauses> {
  ...
  omp.yield
}

I guess the main reason not to go this route is that it is not done for any other composite construct. For this proposal I went with the alternative for which there is some precedent in the OpenMP dialect. Though it may make sense to do this for all composite constructs, assuming the number of them isn't too large and make the dialect unnecessarily complicated/redundant.

4. How did you arrive at the set of clauses supported?

These are just the clauses that were supported by omp.simdloop that weren't related to the loop range and step.

Thinking about it a bit more, it would be possible to make some sort of 2-level representation system for composite constructs. The first level would be to just represent each allowed composite construct as its own MLIR operation, which is what the frontend would produce. Then, there would be an OpenMP dialect MLIR pass to split them up according to their semantics. In this case, from the omp.wsloop_simd example above, we would produce something like this (assuming the worksharing loop schedule doesn't prevent doing it like this):

omp.wsloop (%ii) : index = (%lb) to (%ub) step (%block_size) <worksharing-loop-specific clauses> {
  <calculate bounds of SIMD loop>
  omp.simdloop (%i): index = (%block_lb) to (%block_ub) step (%block_step) <simd-specific clauses> {
    ...
    omp.yield
  }
  omp.yield
}

The good thing about something like this is that when reaching the MLIR to LLVM IR translation, we will get the same input MLIR for equivalent source code regardless of whether app developers used a set of composite/combined constructs or they defined each one of the constructs into various nested loops. Then, the problem would be to recognize these patterns again to be able to target the OpenMP runtime, where certain construct combinations can already be targeted independently.

mjklemm · 2024-01-30T19:10:08Z

Thinking about it a bit more, it would be possible to make some sort of 2-level representation system for composite constructs. The first level would be to just represent each allowed composite construct as its own MLIR operation, which is what the frontend would produce.

While it's not strictly related to this PR, since this is a short-term implementation that likely will replace with the final implementation, I'd be careful with this in the long run. The OpenMP API has several composite constructs and the list might be growing with 6.0. If each of them needs its own MLIR operation, we will end up with many MLIR operations. So, we need to find a way to come up with a scalable representation, that can accommodate for all combinations possible w/o polluting the MLIR operations space too much.

kiranchandramohan · 2024-01-31T12:54:39Z

Thinking about it a bit more, it would be possible to make some sort of 2-level representation system for composite constructs. The first level would be to just represent each allowed composite construct as its own MLIR operation, which is what the frontend would produce. Then, there would be an OpenMP dialect MLIR pass to split them up according to their semantics.

I think the representation of composite constructs in the dialect would be necessary. Whether we can split in the MLIR layer without inserting runtime calls, I am not 100% sure. If we have to insert runtime calls then that has to be delayed till we interface with OpenMPIRBuilder. Also, the existing lowering of worksharing-loops in the OpenMPIRBuilder/Translation might already be forming the inner loops. So this might not directly apply. We also have other transformations like collapse that can affect the structure of the loop.

skatrak · 2024-02-01T16:09:17Z

I've gone through all the legal OpenMP combined and composite constructs (I hope I didn't miss any) to see which additional operations would be needed in order to represent composite constructs independently, and to also compare how they would look in MLIR compared to "wrapper" ops similar what's been proposed for omp.distribute.

The table below lists all these combined/composite constructs as they would be represented using composite ops or wrapper ops (taking some liberties with ops that aren't in the dialect yet), but as a summary what I have found out is that as of OpenMP 5.2, we'd only need to add 5 of these operations: omp.wssimdloop, omp.distsimdloop, omp.distparwsloop, omp.distparwssimdloop, omp.tasksimdloop (and converting omp.distribute into omp.distloop, which would represent the loop itself, to follow the same pattern).

The main problem I see with these is redundancy. The set of accepted arguments for these operations would be the union of all accepted arguments for their leaf constructs, and also they all turned out to represent loops (with indices, ranges, etc). I suppose the second concern would be addressed by keeping all loop-related information into the upcoming omp.canonical_loop and have these ops just have the corresponding CLI as an argument or something similar. Maybe there are ways to avoid repeating the same lists of arguments in TableGen if my first concern turns out to be a real maintainability issue.

So I think that it might be a good idea to indeed create these composite MLIR operations, since they represent distinct behaviors and their shortcomings seem solvable in principle. With regards to the options of de-composing operations in an MLIR pass or leaving that to the MLIR to LLVM IR translation stage, I guess we could figure that out over time. Probably the second approach would be easiest to start.

It'd be interesting to hear if @mjklemm, @jsjodin, @DominikAdamski, @kparzysz or @agozillon have any comments on this.

P.S. Sorry for the massive table, couldn't make it smaller 😅.

Construct	Composite ops	Wrapper ops
`!$omp do simd`	omp.wssimdloop %i... { BODY }	omp.wsloop %i... { omp.simd { BODY } }
`!$omp distribute simd`	omp.distsimdloop %i... { BODY }	omp.distribute { omp.simdloop %i... { BODY } }
`!$omp distribute parallel do`	omp.distparwsloop %i... { BODY }	omp.distribute { omp.parallel { omp.wsloop %i... { BODY } } }
`!$omp distribute parallel do simd`	omp.distparwssimdloop %i... { BODY }	omp.distribute { omp.parallel { omp.wsloop %i... { omp.simd { BODY } } } }
`!$omp taskloop simd`	omp.tasksimdloop %i... { BODY }	omp.taskloop %i... { omp.simd { BODY } }
`!$omp parallel do`	omp.parallel { omp.wsloop %i... { BODY } }	omp.parallel { omp.wsloop %i... { BODY } }
`!$omp parallel loop`	omp.parallel { omp.loop %i... { BODY } }	omp.parallel { omp.loop %i... { BODY } }
`!$omp parallel sections`	omp.parallel { omp.sections { BODY } }	omp.parallel { omp.sections { BODY } }
`!$omp parallel workshare`	omp.parallel { omp.workshare { BODY } }	omp.parallel { omp.workshare { BODY } }
`!$omp parallel do simd`	omp.parallel { omp.wssimdloop %i... { BODY } }	omp.parallel { omp.wsloop %i... { omp.simd { BODY } } }
`!$omp parallel masked`	omp.parallel { omp.masked { BODY } }	omp.parallel { omp.masked { BODY } }
`!$omp masked taskloop`	omp.masked { omp.taskloop %i... { BODY } }	omp.masked { omp.taskloop %i... { BODY } }
`!$omp masked taskloop simd`	omp.masked { omp.tasksimdloop %i... { BODY } }	omp.masked { omp.taskloop %i... { omp.simd { BODY } } }
`!$omp parallel masked taskloop`	omp.parallel { omp.masked { omp.taskloop %i... { BODY } } }	omp.parallel { omp.masked { omp.taskloop %i... { BODY } } }
`!$omp parallel masked taskloop simd`	omp.parallel { omp.masked { omp.tasksimdloop %i... { BODY } } }	omp.parallel { omp.masked { omp.taskloop %i... { omp.simd { BODY } } } }
`!$omp teams distribute`	omp.teams { omp.distloop %i... { BODY } }	omp.teams { omp.distribute { omp.wsloop %i... { BODY } } }
`!$omp teams distribute simd`	omp.teams { omp.distsimdloop %i... { BODY } }	omp.teams { omp.distribute { omp.simdloop %i... { BODY } } }
`!$omp teams distribute parallel do`	omp.teams { omp.distparwsloop %i... { BODY } }	omp.teams { omp.distribute { omp.parallel { omp.wsloop %i... { BODY } } } }
`!$omp teams distribute parallel do simd`	omp.teams { omp.distparwssimdloop %i... { BODY } }	omp.teams { omp.distribute { omp.parallel { omp.wsloop %i... { omp.simd { BODY } } } } }
`!$omp teams loop`	omp.teams { omp.loop %i... { BODY } }	omp.teams { omp.loop %i... { BODY } }
`!$omp target parallel`	omp.target { omp.parallel { BODY } }	omp.target { omp.parallel { BODY } }
`!$omp target parallel do`	omp.target { omp.parallel { omp.wsloop %i... { BODY } } }	omp.target { omp.parallel { omp.wsloop %i... { BODY } } }
`!$omp target parallel do simd`	omp.target { omp.parallel { omp.wssimdloop %i... { BODY } } }	omp.target { omp.parallel { omp.wsloop %i... { omp.simd { BODY } } } }
`!$omp target parallel loop`	omp.target { omp.parallel { omp.loop %i... { BODY } } }	omp.target { omp.parallel { omp.loop %i... { BODY } } }
`!$omp target simd`	omp.target { omp.simdloop %i... { BODY } }	omp.target { omp.simdloop %i... { BODY } }
`!$omp target teams`	omp.target { omp.teams { BODY } }	omp.target { omp.teams { BODY } }
`!$omp target teams distribute`	omp.target { omp.teams { omp.distloop %i... { BODY } } }	omp.target { omp.teams { omp.distribute { omp.wsloop %i... { BODY } } } }
`!$omp target teams distribute simd`	omp.target { omp.teams { omp.distsimdloop %i... { BODY } } }	omp.target { omp.teams { omp.distribute { omp.simdloop %i... { BODY } } } }
`!$omp target teams loop`	omp.target { omp.teams { omp.loop %i... { BODY } } }	omp.target { omp.teams { omp.loop %i... { BODY } } }
`!$omp target teams distribute parallel do`	omp.target { omp.teams { omp.distparwsloop %i... { BODY } } }	omp.target { omp.teams { omp.distribute { omp.parallel { omp.wsloop %i... { BODY } } } } }
`!$omp target teams distribute parallel do simd`	omp.target { omp.teams { omp.distparwssimdloop %i... { BODY } } }	omp.target { omp.teams { omp.distribute { omp.parallel { omp.wsloop %i... { omp.simd { BODY } } } } } }

mjklemm · 2024-02-01T17:03:10Z

Wow! That's impressive, thanks for compiling this. I have two comments:

I'd recommend to split combined constructs and composite constructs into distinct tables.
OpenMP 6.0 will greatly increase the number of these constructs, so a general solution will be desired in the long-run.

DominikAdamski · 2024-02-02T09:53:47Z

@skatrak Thanks for great work with summarizing the OpenMP constructs.

Let me express my thoughts:

Lowering of composite operations will be harder because we will need to combine some lowering steps into one operation (for example: omp.distparwsloop will require to generate two runtime calls for the device call (one for enabling parallel execution -> kmpc_parallel_51 and one for workshare loop ). Wrapper operations are aligned with current code generation schemes.
Some composite operations can denote the same. For example omp.wssimdloop and omp.wsloop are exactly the same if simd length = 1. Having said that, the MLIR optimization opportunities can be lower for composite operations.
Maybe we have to split OpenMP dialect in to two sub-dialects. The high level dialect will contain composite operations. The lower one will reflect LLVM IR code structure. The MLIR lowering pass can simplify OpenMPIRBuilder logic.
I don't know how reductions will play with composite operations.

skatrak · 2024-02-02T14:31:12Z

Wow! That's impressive, thanks for compiling this. I have two comments:

* I'd recommend to split combined constructs and composite constructs into distinct tables.

* OpenMP 6.0 will greatly increase the number of these constructs, so a general solution will be desired in the long-run.

Thanks for giving it a look. With regards to the first point, the 5 composite constructs that I was able to identify are in the first 5 rows of the table. The rest are combined constructs which in some cases contain one of these composite constructs inside. This can be seen by looking at the second column, where nesting of an operation inside another represents 'combination' and 'composition' is represented by an individual operation including the name of multiple leaf constructs. Maybe the table can be split after the first 5 rows, but it seems misleading to me saying that e.g. teams distribute simd is combined because it represents TEAMS with a single DISTRIBUTE SIMD (composite) nested inside. That's why I put everything together in a single table.

Concerning the increase in the number of these constructs, what are these additions related to? Are they loop transformations or are there going to be significant additions to parallelism generation/control and work distribution constructs? Combined constructs can already be represented in a scalable way, through nesting of ops, so I was thinking that we'd only have to represent composite constructs related to the last two categories, because loop transformations would be handled independently as well. I don't know whether there are caveats to that, but my thinking was that after introducing omp.canonical_loop we could have something like the following:

%cli = omp.canonical_loop %i... {
  BODY
}

// loop transformations (e.g. %1 = omp.tile %cli...) here, before execution resulting in a single loop nest stored in %loop

// !$omp do simd (composite construct)
omp.wssimdloop %loop <do,simd-clauses>

// !$omp parallel do (combined construct)
omp.parallel <parallel-clauses> {
  omp.wsloop %loop <do-clauses>
}

// !$omp teams distribute parallel do (combined construct with composite construct inside)
omp.target <target-clauses> {
  omp.teams <teams-clauses> {
    omp.distparwsloop %loop <distribute,parallel,do-clauses>
  }
}

In that case, we shouldn't hopefully have to add many new operations. Only the set of 5 above and those related to new parallelism generation/control and work distribution constructs. How omp.canonical_loop is going to be defined and used is still under discussion, so it may end up looking very differently to this, but the idea of using a set of single/composite operations to express how a loop is supposed to run, independently of its transformations, may make some sense.

Maybe we could instead split even these composite constructs into their leaf ops instead. Here, the composite ops are built in advance by chaining the loop returned by one as the input of the next, but I'm not sure if that's even an improvement:

%cli = omp.canonical_loop %i... {
  BODY
}

// loop transformations (e.g. %1 = omp.tile %cli...) here, before execution resulting in a single loop nest stored in %loop

// !$omp do simd (composite construct)
%1 = omp.wsloop %loop <do-clauses>
%2 = omp.simdloop %1 <simd-clauses>
omp.execute %2

// !$omp parallel do (combined construct)
%1 = omp.wsloop %loop <do-clauses>
omp.parallel <parallel-clauses> {
  omp.execute %1
}

// !$omp target teams distribute parallel do (combined construct with composite construct inside)
%1 = omp.distribute %loop <distribute-clauses>
%2 = omp.parwsloop %1 <parallel,do-clauses> // Not sure if this could be expressed in a better way
omp.target <target-clauses> {
  omp.teams <teams-clauses> {
    omp.execute %2
  }
}

mjklemm · 2024-02-07T13:53:54Z

Thanks for giving it a look. With regards to the first point, the 5 composite constructs that I was able to identify are in the first 5 rows of the table. [...] Maybe the table can be split after the first 5 rows, but it seems misleading to me saying that e.g. teams distribute simd is combined because it represents TEAMS with a single DISTRIBUTE SIMD (composite) nested inside. That's why I put everything together in a single table.

Ok, that makes sense to me.

Concerning the increase in the number of these constructs, what are these additions related to? Are they loop transformations or are there going to be significant additions to parallelism generation/control and work distribution constructs?

Sort of. So, there will be a set of rules that enable many more combinations of combined constructs. While the debate has not quite settled yet and the OpenMP ARB is looking at the number of new combinations, I'd expect that there will quite a few being added, e.g., task loop.

Plus, there will be new loop transforming directives, such as loop splitting, loop reversal, etc. Those you will be able to freely combine, for instance:

!$omp tile size(4,4,4,4) apply(intratile:nothing,nothing,nothing,unroll)
<loop-nest>

I don't know whether there are caveats to that, but my thinking was that after introducing omp.canonical_loop we could have something like the following:

I do not know either. If things are already implemented such that new (loop) constructs and combinations thereof can easily be added, it's great!

jsjodin · 2024-02-07T15:38:01Z

Using wrapper ops seems to be a better option imo since it should be easier to extend and we avoid the combinatorial explosion. I have one question about if there are cases where writing something on two separate omp lines vs a single line would change the semantics, e.g. in one case it is a combined construct but the other it isn't? I believe another issue with not using wrapper ops is that it may not be possible to create the combined ops directly if there are loop transformation ops present, which means the wrapper ops will have to be used anyway. To clarify, if the loop transformations are handled by the OpenMPIRBuilder, then there is no point in the compilation where combined ops could be created. I'm not convinced this is the right approach though, it would make sense to me to to the loop transformations first and then do the lowering.

There is another problem which @DominikAdamski was mentioning and that I've been trying to get some clarity on, which is how to communicate information between op lowerings since there are dependencies between e.g. distribute and parallel/wsloop, this will have to be solved. In the case of loops there is a proposal to use CLIs, but this is only one kind of information that needs to be communicated, another is reduction information, and there are other cases as well (collapse?). I think it makes sense that the OpenMPIRBuilder keeps track of these things. @kiranchandramohan suggested using block arguments for CLIs, which probably fits better with this approach, and the solution would be more uniform compared to special handling for CLIs. Another option would be to have other kind of lowering information represented in MLIR as values like CLIs, but it might degenerate if CLIs get expanded to hold other information into having a single value linking all the omp ops, which would just represent the state in the OpenMPIRBuilder. This is why I"m leaning towards wrapper ops with a simple recursive traversal and the OpenMPIRBuilder keeping the information that needs to be passed between the op lowerings.

skatrak · 2024-02-07T15:58:05Z

@skatrak Thanks for great work with summarizing the OpenMP constructs.

Let me express my thoughts:

1. Lowering of composite operations will be harder because we will need to combine some lowering steps into one operation (for example: `omp.distparwsloop` will require to generate two runtime calls for the device call (one for enabling parallel execution -> kmpc_parallel_51 and one for workshare loop ). Wrapper operations are aligned with current code generation schemes.

2. Some composite operations can denote the same. For example `omp.wssimdloop` and `omp.wsloop` are exactly the same if `simd length = 1`. Having said that, the MLIR optimization opportunities can be lower for composite operations.

3. Maybe we have to split OpenMP dialect in to two sub-dialects. The high level dialect will contain composite operations. The lower one will reflect LLVM IR code structure. The MLIR lowering pass can simplify OpenMPIRBuilder logic.

4. I don't know how reductions will play with composite operations.

Thanks Dominik for sharing your thoughts on this, and excuse the delay in getting back to you. I'll try to share what I think about these.

I think it should be possible to address this issue by doing a minor refactoring. In the OpenMP to LLVMIR translation stage, we currently have convertOmp<Op-Name> functions we call for each of the defined MLIR operations. It would be possible to create some convertOmp<Composite-Name> that instead of re-implementing all that, could actually call some outlined subset of the corresponding convertOmp<Op-Name> functions together with any other special codegen that may be needed. Maybe these outlined functions would take as arguments new MLIR interfaces to represent each single construct that can be part of a composite one.
In the case of SIMD, where it would be legal to "ignore" the construct and codegen for width=1, it should be fine to just call the non-SIMD lowering function for the cases for which we don't currently support or want vectorization. Not sure about the potential of missing MLIR optimizations by creating composite operations.
I agree that this is a possibility as well and it was something mentioned before. The two options are to do this in MLIR with a higher-level (composite ops) and a lower-level (single ops) dialect as you say or to deal with the splitting when lowering to LLVM IR. It would be a matter of agreeing on a path forward, but both options should be possible.
Not sure either about this.

skatrak · 2024-02-07T16:38:58Z

Thanks Jan for adding to the discussion.

Using wrapper ops seems to be a better option imo since it should be easier to extend and we avoid the combinatorial explosion.

Here I think the main factor would be how many new composite constructs would be added, because combined constructs are already and should remain represented as wrapper ops, and they make up most of the table above. If there's a chance that many new composite constructs are added (or even one or two new single constructs are added that can be composite with all existing ones) then the option of having composite ops really makes no sense. As it stands right now, both alternatives still make sense.

I have one question about if there are cases where writing something on two separate omp lines vs a single line would change the semantics, e.g. in one case it is a combined construct but the other it isn't?

Yes, so a combined construct can be split into two constructs where the second is nested directly inside of the first. The first is, as far as I can tell, always a leaf construct (i.e. not combined or composite). So, for example, parallel do is a combined construct that means the same as parallel with a single do nested as its only child. Composite constructs are the ones where this is not the case like, for example, distribute parallel do, which has its own semantics (run in parallel iterations of the associated collapsed loop nest by threads of multiple teams). In that case, it wouldn't even be syntactically correct to create a distribute construct with a nested parallel do construct.

I believe another issue with not using wrapper ops is that it may not be possible to create the combined ops directly if there are loop transformation ops present, which means the wrapper ops will have to be used anyway. To clarify, if the loop transformations are handled by the OpenMPIRBuilder, then there is no point in the compilation where combined ops could be created. I'm not convinced this is the right approach though, it would make sense to me to to the loop transformations first and then do the lowering.

I'm not as familiar to the discussions on representing loop transformations in the OpenMP dialect, but I think that whatever it ends up looking like it should be in a way independent from the parallelism-generating and work-distributing operations. By that I mean that it shouldn't matter whether a loop has been transformed or not by the time we want to run it. We should try to split the what from the how (what loop to run vs how to execute the loop).

There is another problem which @DominikAdamski was mentioning and that I've been trying to get some clarity on, which is how to communicate information between op lowerings since there are dependencies between e.g. distribute and parallel/wsloop, this will have to be solved. In the case of loops there is a proposal to use CLIs, but this is only one kind of information that needs to be communicated, another is reduction information, and there are other cases as well (collapse?). I think it makes sense that the OpenMPIRBuilder keeps track of these things. @kiranchandramohan suggested using block arguments for CLIs, which probably fits better with this approach, and the solution would be more uniform compared to special handling for CLIs. Another option would be to have other kind of lowering information represented in MLIR as values like CLIs, but it might degenerate if CLIs get expanded to hold other information into having a single value linking all the omp ops, which would just represent the state in the OpenMPIRBuilder. This is why I"m leaning towards wrapper ops with a simple recursive traversal and the OpenMPIRBuilder keeping the information that needs to be passed between the op lowerings.

While I don't currently get the full picture, the dependencies you mention between distribute and parallel/wsloop seem to stem from the fact that distribute parallel do is a composite construct. So maybe the solution is not to share lowering state between these but rather to recognize this is another construct separate from parallel do and to handle it independently, possibly sharing a good amount of code with that other combined construct.

DominikAdamski · 2024-02-08T14:06:05Z

Let me add some points to the discussion:

1. Exactness representation of wrapper approach.
Wrapper approach allows to express the link between OpenMP code and MLIR code.
Code:

omp.distribute {
  omp.wsloop (%i ... ) ) {
  }
}

denotes:

#pragma omp distribute
for (...)

User is not allowed to specify another pragma after distribute construct. That's why we always know what was the intention of the OpenMP user. IMO we don't need to know that given construct is composite or combined at MLIR level. The more important is the user intention. When we introduce omp.simd operation to denote that loop body should be executed by SIMD instructions we can clearly differentiate between:

#pragma omp parallel
#pragma omp simd
 for (...)

and:

#pragma omp parallel
#pragma omp for simd
for (...)

2. Upstream definition of omp.distribute imply wrapper approach
Upstream definition of distribute construct describes omp.distribute as wrapper operation similar to omp.parallel . I think that proposed omp.simd operation follows the same pattern.

3. We have pretty good support for omp.wsloop inside OpenMPIRBuilder
OpenMPIRBuilder can handle collapse clause, different scheduling or reduction for omp.wsloop. There is no need to refactor this code for wrapper approach.

4. omp.canonical loop
I hope that wrapper approach will be more future proof (less operations) and it will play well with omp.canonicalloop.

ergawy · 2024-02-08T14:37:47Z

mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp

+
+      if (hasOther)
+        break;
+    } else if (!op.hasTrait<OpTrait::IsTerminator>()) {


Does the omp.wsloop support other terminators than omp.yield? If not, should we verify that it is indeed a yield?

By looking at the OpenMP dialect definition and existing verifiers, it appears that omp.terminator is currently a valid terminator for omp.wsloop (omp.yield does restrict which ops can be a parent, but there's no such thing for omp.terminator), so rather than introducing restrictions that currently don't exist (not sure whether intentionally or not) I decided to just accept any terminator here.

jsjodin · 2024-02-08T17:01:03Z

While I don't currently get the full picture, the dependencies you mention between distribute and parallel/wsloop seem to stem from the fact that distribute parallel do is a composite construct. So maybe the solution is not to share lowering state between these but rather to recognize this is another construct separate from parallel do and to handle it independently, possibly sharing a good amount of code with that other combined construct.

Below is one example that hopefully illustrates the various issues.

#pragma omp target teams distribute parallel for
#pragma omp tile sizes(2)
for (int i = 0; i < N; ++i) {
      code(i);
    }
}

is equivalent to:

#pragma omp target teams distribute parallel for
for (int ii = 0; ii < N; ii+=2) {
  for (int i = ii; i+2; ++i) {
    code(i);
  }
}

If the lowering of the #pragma omp tile sizes(2) happens in the OMPIRBuilder, that means we would have to determine at lowering time that the resulting outer loop should be distribute parallel for, while the inner loop is sequential. If we are also bound by the current lowering mechanism, meaning a single pass over the MLIR module, then I think that implies there is no way to create combined constructs ahead of time, the ops have to be individual wrapper ops, and there has to be some information sharing between them during lowering. Also consider the case if collapse(2) was present, which would affect the codegen for the inner loop differently.

skatrak · 2024-02-09T13:04:02Z

While I don't currently get the full picture, the dependencies you mention between distribute and parallel/wsloop seem to stem from the fact that distribute parallel do is a composite construct. So maybe the solution is not to share lowering state between these but rather to recognize this is another construct separate from parallel do and to handle it independently, possibly sharing a good amount of code with that other combined construct.

Below is one example that hopefully illustrates the various issues.
#pragma omp target teams distribute parallel for
#pragma omp tile sizes(2)
for (int i = 0; i < N; ++i) {
      code(i);
    }
}
is equivalent to:
#pragma omp target teams distribute parallel for
for (int ii = 0; ii < N; ii+=2) {
  for (int i = ii; i+2; ++i) {
    code(i);
  }
}
If the lowering of the #pragma omp tile sizes(2) happens in the OMPIRBuilder, that means we would have to determine at lowering time that the resulting outer loop should be distribute parallel for, while the inner loop is sequential. If we are also bound by the current lowering mechanism, meaning a single pass over the MLIR module, then I think that implies there is no way to create combined constructs ahead of time, the ops have to be individual wrapper ops, and there has to be some information sharing between them during lowering. Also consider the case if collapse(2) was present, which would affect the codegen for the inner loop differently.

Thanks for the clarification. However, I still can't quite understand how using composite operations to represent distribute-parallel-for in that example vs wrapper ops is any different with regards to this loop transformation problem. If we're able to represent the loop transformations in MLIR, even if we only actually produce these loops in the OMPIRBuilder due to it being the place where loop transformations are eventually applied, we should be able to then apply the corresponding parallelization/worksharing constructs to the resulting top-level loop. Using your example above, and making assumptions as to how loops and transformations might be represented in MLIR:

// Define loop body
%loop = omp.canonical_loop %i : i32 context(...) {
  // Use additional block args to access context values defined in the execution site
  llvm.call @code(%i) : (i32) -> ()
  omp.terminator
} (...) -> !omp.loop

// Represent transformations
%tloop = omp.tile %loop { sizes=[2] } : (!omp.loop) -> (!omp.loop)

// OPTION 1: Composite ops
omp.target {
  %n = ... : i32
  %c0 = arith.constant 0 : i32
  %c1 = arith.constant 1 : i32
  omp.teams {
    // Execute loop, specifying pre-transformations loop range
    omp.distparwsloop %tloop from(%c0) to(%n) step(%c1) context(...) : (!omp.loop, i32, i32, i32, ...) -> ()
    omp.terminator
  }
  omp.terminator
}

// OPTION 2: Wrapper ops
omp.target {
  %n = ... : i32
  %c0 = arith.constant 0 : i32
  %c1 = arith.constant 1 : i32
  omp.teams {
    omp.distribute {
      omp.parallel {
        // Execute loop, specifying pre-transformations loop range
        omp.wsloop %tloop from(%c0) to(%n) step(%c1) context(...) : (!omp.loop, i32, i32, i32, ...) -> ()
        omp.terminator
      }
      omp.terminator
    }
    omp.terminator
  }
  omp.terminator
}

In the example above, there are probably some challenges related to how to lower the loop body and transformations to it since it is defined outside of the place where it is eventually executed. Maybe it should be ignored at first and then processed when lowering the ops that run it (omp.distparwsloop or omp.wsloop in this case), or maybe it should be generated in a temporary outlined function which is then called or inlined at the point where it is run, passing loop bounds, step and context as arguments. But I think this is no more or less complicated regardless of how we represent the parallelism/worksharing constructs used to represent the loop's execution.

From the MLIR to LLVM IR translation perspective, the wrapper approach would make the following calls:

convertOmpTarget
convertOmpTeams
convertOmpDistribute
convertOmpParallel
convertOmpWsLoop

Whereas the composite approach would make the following calls:

convertOmpTarget
convertOmpTeams
convertOmpDistParWsLoop

It seems to me that convertOmpDistParWsLoop could be split up into calls to the equivalent of convertOmpDistribute + convertOmpParallel + convertOmpWsLoop if that's the best way to deal with that composite construct. If anything, it looks like it would give us some flexibility as to how to share information (for e.g. reductions) across these partial translations of the composite construct rather than having to add more state to the OMPIRBuilder.

kiranchandramohan · 2024-02-09T13:12:36Z

While I don't currently get the full picture, the dependencies you mention between distribute and parallel/wsloop seem to stem from the fact that distribute parallel do is a composite construct. So maybe the solution is not to share lowering state between these but rather to recognize this is another construct separate from parallel do and to handle it independently, possibly sharing a good amount of code with that other combined construct.

Below is one example that hopefully illustrates the various issues.
#pragma omp target teams distribute parallel for
#pragma omp tile sizes(2)
for (int i = 0; i < N; ++i) {
      code(i);
    }
}
is equivalent to:
#pragma omp target teams distribute parallel for
for (int ii = 0; ii < N; ii+=2) {
  for (int i = ii; i+2; ++i) {
    code(i);
  }
}
If the lowering of the #pragma omp tile sizes(2) happens in the OMPIRBuilder, that means we would have to determine at lowering time that the resulting outer loop should be distribute parallel for, while the inner loop is sequential. If we are also bound by the current lowering mechanism, meaning a single pass over the MLIR module, then I think that implies there is no way to create combined constructs ahead of time, the ops have to be individual wrapper ops, and there has to be some information sharing between them during lowering. Also consider the case if collapse(2) was present, which would affect the codegen for the inner loop differently.

We already have some lowering (OpenMP+LLVM dialect -> LLVM IR) where the loop transformation is not directly applied to the immediate loop. The best example of this is collapse followed by worksharing (see links below). There will also be a similar one for simd.

llvm-project/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

Line 949 in b5a273a

llvm::CanonicalLoopInfo *loopInfo =

llvm-project/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

Line 960 in b5a273a

ompBuilder->applyWorkshareLoop(

Basically there will be some canonical loop infos that will be processed by various loop transformations (collapse, tile, unroll etc). The CanonicalLoopInfo transformation in the OpenMPIRBuilder is already set up for this purpose. And I believe it already supports tile, unroll, collapse. So for these transformations going ahead with the OpenMPIRBuilder approach might be the easiest for you. I would anticipate that the wrapper operation codegen (openmp + llvm dialect -> llvm ir) will generate canonical loop infos and deposit on the stack if it has a parent operation that will further modify it.

jsjodin · 2024-02-09T17:21:24Z

Thanks for the clarification. However, I still can't quite understand how using composite operations to represent distribute-parallel-for in that example vs wrapper ops is any different with regards to this loop transformation problem. If we're able to represent the loop transformations in MLIR, even if we only actually produce these loops in the OMPIRBuilder due to it being the place where loop transformations are eventually applied, we should be able to then apply the corresponding parallelization/worksharing constructs to the resulting top-level loop.

Yes, I think I misunderstood a bit how to interpret the composite ops. If they simply are collapsed versions of the nested ones both are equivalent. It would also be possible to keep the wrapper ops and have lowering functions that would would identify and lower multiple ops in one step.

It seems to me that convertOmpDistParWsLoop could be split up into calls to the equivalent of convertOmpDistribute + convertOmpParallel + convertOmpWsLoop if that's the best way to deal with that composite construct. If anything, it looks like it would give us some flexibility as to how to share information (for e.g. reductions) across these partial translations of the composite construct rather than having to add more state to the OMPIRBuilder.

If we can keep the information sharing within a single lowering function would be nice. Adding more state to the OMPIRBuilder is a concern, because it is fairly easy to create an inconsistent state if data is not added and cleared in the right places during lowering. If we decide to use wrapper ops, we could still have lowering functions that handle composite constructs to avoid adding state to the OMPIRBuilder. That way the representation is more flexible with no risk of a combinatorial explosion, but the implementation of the lowering would have this problem to some extent, although it can be minimized by having functions to share the implementation as you mentioned.

skatrak · 2024-02-14T15:34:58Z

Discussion about combined/composite constructs continued here: https://discourse.llvm.org/t/rfc-representing-combined-composite-constructs-in-the-openmp-dialect/76986

skatrak · 2024-04-09T11:28:06Z

This patch is replaced by: #87365. The wrapper approach will allow the use of the same omp.simd operation for !$omp simd and !$omp do simd constructs, the difference being the presence of an omp.wsloop wrapper as well.

skatrak requested review from ergawy, jsjodin, kiranchandramohan, abidh, mjklemm, raghavendhra, agozillon, DominikAdamski, TIFitis, kiranktp and kparzysz January 29, 2024 14:56

llvmbot added mlir mlir:openmp flang:openmp labels Jan 29, 2024

DominikAdamski reviewed Jan 29, 2024

View reviewed changes

DominikAdamski mentioned this pull request Jan 29, 2024

[Flang][OpenMP] Use simdloop operation only for omp simd pragma #79559

Merged

DominikAdamski reviewed Jan 30, 2024

View reviewed changes

skatrak mentioned this pull request Feb 1, 2024

[OpenMP][Draft] Breaking up of combined/composite directives #80059

Closed

DominikAdamski mentioned this pull request Feb 2, 2024

[Flang][OpenMP][Lower] Add lowering support of OpenMP distribute to MLIR #67798

Merged

DominikAdamski closed this Feb 8, 2024

DominikAdamski reopened this Feb 8, 2024

ergawy reviewed Feb 8, 2024

View reviewed changes

skatrak closed this Apr 9, 2024

[MLIR][OpenMP] Add omp.simd operation #79843

[MLIR][OpenMP] Add omp.simd operation #79843

Uh oh!

Conversation

skatrak commented Jan 29, 2024

Uh oh!

llvmbot commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 29, 2024

Uh oh!

DominikAdamski left a comment

Choose a reason for hiding this comment

Uh oh!

kiranchandramohan commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DominikAdamski Jan 30, 2024

Choose a reason for hiding this comment

Uh oh!

skatrak commented Jan 30, 2024

Uh oh!

mjklemm commented Jan 30, 2024

Uh oh!

kiranchandramohan commented Jan 31, 2024

Uh oh!

skatrak commented Feb 1, 2024

Uh oh!

mjklemm commented Feb 1, 2024

Uh oh!

DominikAdamski commented Feb 2, 2024

Uh oh!

skatrak commented Feb 2, 2024

Uh oh!

mjklemm commented Feb 7, 2024

Uh oh!

jsjodin commented Feb 7, 2024

Uh oh!

skatrak commented Feb 7, 2024

Uh oh!

skatrak commented Feb 7, 2024

Uh oh!

DominikAdamski commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ergawy Feb 8, 2024

Choose a reason for hiding this comment

Uh oh!

skatrak Feb 8, 2024

Choose a reason for hiding this comment

Uh oh!

jsjodin commented Feb 8, 2024

Uh oh!

skatrak commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiranchandramohan commented Feb 9, 2024

Uh oh!

jsjodin commented Feb 9, 2024

Uh oh!

skatrak commented Feb 14, 2024

Uh oh!

skatrak commented Apr 9, 2024

Uh oh!

Uh oh!

llvmbot commented Jan 29, 2024 •

edited

Loading

kiranchandramohan commented Jan 29, 2024 •

edited

Loading

DominikAdamski commented Feb 8, 2024 •

edited

Loading

skatrak commented Feb 9, 2024 •

edited

Loading