Skip to content

[AArch64] Add Ampere1B scheduling/pipeline model #81338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ptomsich
Copy link
Contributor

@ptomsich ptomsich commented Feb 9, 2024

The Ampere1B core is enabled with a new scheduling/pipeline model, as it provides significant updates over the Ampere1 core; it reduces latencies on many instructions, has some micro-ops reassigned between the XY and X units, and provides modelling for the instructions added since Ampere1 and Ampere1A.

The Ampere1B core is enabled with a new scheduling/pipeline model, as
it provides significant updates over the Ampere1 core; it reduces
latencies on many instructions, has some micro-ops reassigned between
the XY and X units, and provides modelling for the instructions added
since Ampere1 and Ampere1A.
@llvmbot
Copy link
Member

llvmbot commented Feb 9, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Philipp Tomsich (ptomsich)

Changes

The Ampere1B core is enabled with a new scheduling/pipeline model, as it provides significant updates over the Ampere1 core; it reduces latencies on many instructions, has some micro-ops reassigned between the XY and X units, and provides modelling for the instructions added since Ampere1 and Ampere1A.


Patch is 43.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/81338.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64.td (+2-1)
  • (added) llvm/lib/Target/AArch64/AArch64SchedAmpere1B.td (+1061)
diff --git a/llvm/lib/Target/AArch64/AArch64.td b/llvm/lib/Target/AArch64/AArch64.td
index 00833b419ec32c..867839fb8c19e5 100644
--- a/llvm/lib/Target/AArch64/AArch64.td
+++ b/llvm/lib/Target/AArch64/AArch64.td
@@ -837,6 +837,7 @@ include "AArch64SchedA64FX.td"
 include "AArch64SchedThunderX3T110.td"
 include "AArch64SchedTSV110.td"
 include "AArch64SchedAmpere1.td"
+include "AArch64SchedAmpere1B.td"
 include "AArch64SchedNeoverseN1.td"
 include "AArch64SchedNeoverseN2.td"
 include "AArch64SchedNeoverseV1.td"
@@ -1720,7 +1721,7 @@ def : ProcessorModel<"ampere1", Ampere1Model, ProcessorFeatures.Ampere1,
 def : ProcessorModel<"ampere1a", Ampere1Model, ProcessorFeatures.Ampere1A,
                      [TuneAmpere1A]>;
 
-def : ProcessorModel<"ampere1b", Ampere1Model, ProcessorFeatures.Ampere1B,
+def : ProcessorModel<"ampere1b", Ampere1BModel, ProcessorFeatures.Ampere1B,
                      [TuneAmpere1B]>;
 
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AArch64/AArch64SchedAmpere1B.td b/llvm/lib/Target/AArch64/AArch64SchedAmpere1B.td
new file mode 100644
index 00000000000000..43da76207ff7de
--- /dev/null
+++ b/llvm/lib/Target/AArch64/AArch64SchedAmpere1B.td
@@ -0,0 +1,1061 @@
+//=- AArch64SchedAmpere1B.td - Ampere-1B scheduling def -----*- tablegen -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines the machine model for the Ampere Computing Ampere-1B to
+// support instruction scheduling and other instruction cost heuristics.
+//
+//===----------------------------------------------------------------------===//
+
+// The Ampere-1 core is an out-of-order micro-architecture.  The front
+// end has branch prediction, with a 10-cycle recovery time from a
+// mispredicted branch.  Instructions coming out of the front end are
+// decoded into internal micro-ops (uops).
+
+def Ampere1BModel : SchedMachineModel {
+  let IssueWidth            =   4;  // 4-way decode and dispatch
+  let MicroOpBufferSize     = 192;  // micro-op re-order buffer size
+  let LoadLatency           =   3;  // Optimistic load latency
+  let MispredictPenalty     =  10;  // Branch mispredict penalty
+  let LoopMicroOpBufferSize =  32;  // Instruction queue size
+  let CompleteModel = 0;
+
+  list<Predicate> UnsupportedFeatures = !listconcat(SVEUnsupported.F,
+                                                    SMEUnsupported.F,
+                                                    PAUnsupported.F);
+}
+
+let SchedModel = Ampere1BModel in {
+
+//===----------------------------------------------------------------------===//
+// Define each kind of processor resource and number available on Ampere-1.
+// Ampere-1 has 12 pipelines that 8 independent scheduler (4 integer, 2 FP,
+// and 2 memory) issue into.  The integer and FP schedulers can each issue
+// one uop per cycle, while the memory schedulers can each issue one load
+// and one store address calculation per cycle.
+
+def Ampere1BUnitA  : ProcResource<2>;  // integer single-cycle, branch, and flags r/w
+def Ampere1BUnitB  : ProcResource<2>;  // integer single-cycle, and complex shifts
+def Ampere1BUnitBS : ProcResource<1>;  // integer multi-cycle
+def Ampere1BUnitL  : ProcResource<2>;  // load
+def Ampere1BUnitS  : ProcResource<2>;  // store address calculation
+def Ampere1BUnitX  : ProcResource<1>;  // FP and vector operations, and flag write
+def Ampere1BUnitY  : ProcResource<1>;  // FP and vector operations, and crypto
+def Ampere1BUnitZ  : ProcResource<1>;  // FP store data and FP-to-integer moves
+
+def Ampere1BUnitAB : ProcResGroup<[Ampere1BUnitA, Ampere1BUnitB]>;
+def Ampere1BUnitXY : ProcResGroup<[Ampere1BUnitX, Ampere1BUnitY]>;
+
+//===----------------------------------------------------------------------===//
+// Define customized scheduler read/write types specific to the Ampere-1.
+
+def Ampere1BWrite_1cyc_1A : SchedWriteRes<[Ampere1BUnitA]> {
+  let Latency = 1;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_1cyc_2A : SchedWriteRes<[Ampere1BUnitA, Ampere1BUnitA]> {
+  let Latency = 1;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_1cyc_1B : SchedWriteRes<[Ampere1BUnitB]> {
+  let Latency = 1;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_1cyc_1BS : SchedWriteRes<[Ampere1BUnitBS]> {
+  let Latency = 1;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_1cyc_1BS_1B : SchedWriteRes<[Ampere1BUnitBS, Ampere1BUnitB]> {
+  let Latency = 1;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_1cyc_1AB : SchedWriteRes<[Ampere1BUnitAB]> {
+  let Latency = 1;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_1cyc_1AB_1A : SchedWriteRes<[Ampere1BUnitAB, Ampere1BUnitA]> {
+  let Latency = 1;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_1cyc_1L : SchedWriteRes<[Ampere1BUnitL]> {
+  let Latency = 1;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_1cyc_1S : SchedWriteRes<[Ampere1BUnitS]> {
+  let Latency = 1;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_1cyc_2S : SchedWriteRes<[Ampere1BUnitS, Ampere1BUnitS]> {
+  let Latency = 1;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_2cyc_1Y : SchedWriteRes<[Ampere1BUnitY]> {
+  let Latency = 2;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_2cyc_2AB : SchedWriteRes<[Ampere1BUnitAB, Ampere1BUnitAB]> {
+  let Latency = 2;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_2cyc_1B_1AB : SchedWriteRes<[Ampere1BUnitB, Ampere1BUnitAB]> {
+  let Latency = 2;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_2cyc_1B_1S : SchedWriteRes<[Ampere1BUnitB, Ampere1BUnitS]> {
+  let Latency = 2;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_2cyc_1B_1S_1AB : SchedWriteRes<[Ampere1BUnitB,
+                                                  Ampere1BUnitS,
+                                                  Ampere1BUnitAB]> {
+  let Latency = 2;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_2cyc_1XY : SchedWriteRes<[Ampere1BUnitXY]> {
+  let Latency = 2;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_2cyc_1S_1Z : SchedWriteRes<[Ampere1BUnitS, Ampere1BUnitZ]> {
+  let Latency = 2;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_3cyc_1BS : SchedWriteRes<[Ampere1BUnitBS]> {
+  let Latency = 3;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_3cyc_1L : SchedWriteRes<[Ampere1BUnitL]> {
+  let Latency = 3;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_3cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 3;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_3cyc_1XY : SchedWriteRes<[Ampere1BUnitXY]> {
+  let Latency = 3;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_3cyc_1Z : SchedWriteRes<[Ampere1BUnitZ]> {
+  let Latency = 3;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_3cyc_1S_1Z : SchedWriteRes<[Ampere1BUnitS,
+                                              Ampere1BUnitZ]> {
+  let Latency = 3;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_3cyc_1S_2Z : SchedWriteRes<[Ampere1BUnitS,
+                                              Ampere1BUnitZ, Ampere1BUnitZ]> {
+  let Latency = 3;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_3cyc_2S_2Z : SchedWriteRes<[Ampere1BUnitS, Ampere1BUnitS,
+                                              Ampere1BUnitZ, Ampere1BUnitZ]> {
+  let Latency = 3;
+  let NumMicroOps = 4;
+}
+
+def Ampere1BWrite_4cyc_1BS_1AB : SchedWriteRes<[Ampere1BUnitBS, Ampere1BUnitAB]> {
+  let Latency = 4;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_4cyc_1L : SchedWriteRes<[Ampere1BUnitL]> {
+  let Latency = 4;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_4cyc_1L_1B : SchedWriteRes<[Ampere1BUnitL, Ampere1BUnitB]> {
+  let Latency = 4;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_4cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 4;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_4cyc_1XY : SchedWriteRes<[Ampere1BUnitXY]> {
+  let Latency = 4;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_4cyc_2XY : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY]> {
+  let Latency = 4;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_5cyc_1BS : SchedWriteRes<[Ampere1BUnitBS]> {
+  let Latency = 5;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_4cyc_1XY_1S_1Z : SchedWriteRes<[Ampere1BUnitXY,
+                                                  Ampere1BUnitS
+                                                  Ampere1BUnitZ]> {
+  let Latency = 4;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_5cyc_1BS : SchedWriteRes<[Ampere1BUnitBS]> {
+  let Latency = 5;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_5cyc_4S_4Z : SchedWriteRes<[Ampere1BUnitL,
+                                              Ampere1BUnitBS]> {
+  let Latency = 5;
+  let NumMicroOps = 8;
+}
+
+def Ampere1BWrite_5cyc_1L_1BS : SchedWriteRes<[Ampere1BUnitL,
+                                               Ampere1BUnitBS]> {
+  let Latency = 5;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_5cyc_3L : SchedWriteRes<[Ampere1BUnitL,
+                                           Ampere1BUnitL,
+                                           Ampere1BUnitL]> {
+  let Latency = 5;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_5cyc_4L : SchedWriteRes<[Ampere1BUnitL,
+                                           Ampere1BUnitL,
+                                           Ampere1BUnitL,
+                                           Ampere1BUnitL]> {
+  let Latency = 5;
+  let NumMicroOps = 4;
+}
+
+def Ampere1BWrite_5cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 5;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_5cyc_2XY_2S_2Z : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY,
+                                                  Ampere1BUnitS,  Ampere1BUnitS,
+                                                  Ampere1BUnitZ,  Ampere1BUnitZ]> {
+  let Latency = 5;
+  let NumMicroOps = 6;
+}
+
+def Ampere1BWrite_6cyc_1BS_1A : SchedWriteRes<[Ampere1BUnitBS, Ampere1BUnitA]> {
+  let Latency = 6;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_6cyc_1BS_2A : SchedWriteRes<[Ampere1BUnitBS, Ampere1BUnitA,
+                                                               Ampere1BUnitA]> {
+  let Latency = 6;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_6cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 6;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_6cyc_2XY : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY]> {
+  let Latency = 6;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_6cyc_3XY : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY,
+                                            Ampere1BUnitXY]> {
+  let Latency = 6;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_6cyc_2XY_2S_2Z : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY,
+                                                  Ampere1BUnitS,  Ampere1BUnitS,
+                                                  Ampere1BUnitZ,  Ampere1BUnitZ]> {
+  let Latency = 6;
+  let NumMicroOps = 6;
+}
+
+def Ampere1BWrite_6cyc_3XY_3S_3Z : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY, Ampere1BUnitXY,
+                                                  Ampere1BUnitS,  Ampere1BUnitS,  Ampere1BUnitS,
+                                                  Ampere1BUnitZ,  Ampere1BUnitZ,  Ampere1BUnitZ]> {
+  let Latency = 6;
+  let NumMicroOps = 9;
+}
+
+def Ampere1BWrite_7cyc_1BS_1XY : SchedWriteRes<[Ampere1BUnitBS, Ampere1BUnitXY]> {
+  let Latency = 7;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_7cyc_1XY_1Z : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitZ]> {
+  let Latency = 7;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_7cyc_4XY_4S_4Z : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY,
+                                                  Ampere1BUnitXY, Ampere1BUnitXY,
+                                                  Ampere1BUnitS,  Ampere1BUnitS,
+                                                  Ampere1BUnitS,  Ampere1BUnitS,
+                                                  Ampere1BUnitZ,  Ampere1BUnitZ,
+                                                  Ampere1BUnitZ,  Ampere1BUnitZ]> {
+  let Latency = 7;
+  let NumMicroOps = 12;
+}
+
+def Ampere1BWrite_8cyc_1BS_1L : SchedWriteRes<[Ampere1BUnitBS, Ampere1BUnitL]> {
+  let Latency = 8;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_8cyc_1BS_1XY : SchedWriteRes<[Ampere1BUnitBS, Ampere1BUnitXY]> {
+  let Latency = 8;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_8cyc_2XY : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY]> {
+  let Latency = 8;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_8cyc_4XY : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY,
+                                            Ampere1BUnitXY, Ampere1BUnitXY]> {
+  let Latency = 8;
+  let NumMicroOps = 4;
+}
+
+def Ampere1BWrite_9cyc_6XY_4S_4Z : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY,
+                                                  Ampere1BUnitXY, Ampere1BUnitXY,
+                                                  Ampere1BUnitXY, Ampere1BUnitXY,
+                                                  Ampere1BUnitS,  Ampere1BUnitS,
+                                                  Ampere1BUnitS,  Ampere1BUnitS,
+                                                  Ampere1BUnitZ,  Ampere1BUnitZ,
+                                                  Ampere1BUnitZ,  Ampere1BUnitZ]> {
+  let Latency = 9;
+  let NumMicroOps = 14;
+}
+
+def Ampere1BWrite_9cyc_1A_1BS_1X : SchedWriteRes<[Ampere1BUnitA, Ampere1BUnitBS, Ampere1BUnitX]> {
+  let Latency = 9;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_9cyc_1A_1BS_1XY : SchedWriteRes<[Ampere1BUnitA, Ampere1BUnitBS, Ampere1BUnitXY]> {
+  let Latency = 9;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_9cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 9;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_9cyc_3XY : SchedWriteRes<[Ampere1BUnitXY, Ampere1BUnitXY, Ampere1BUnitXY]> {
+  let Latency = 9;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_11cyc_1BS_2XY : SchedWriteRes<[Ampere1BUnitBS, Ampere1BUnitXY, Ampere1BUnitXY]> {
+  let Latency = 11;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_12cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 12;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_13cyc_1BS_1X : SchedWriteRes<[Ampere1BUnitBS, Ampere1BUnitX]> {
+  let Latency = 13;
+  let NumMicroOps = 2;
+}
+
+def Ampere1BWrite_17cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 17;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_19cyc_2BS_1X : SchedWriteRes<[Ampere1BUnitBS,
+                                                Ampere1BUnitBS,
+                                                Ampere1BUnitX]> {
+  let Latency = 13;
+  let NumMicroOps = 3;
+}
+
+def Ampere1BWrite_19cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 19;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_21cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 21;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_33cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 33;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_39cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 39;
+  let NumMicroOps = 1;
+}
+
+def Ampere1BWrite_63cyc_1X : SchedWriteRes<[Ampere1BUnitX]> {
+  let Latency = 63;
+  let NumMicroOps = 1;
+}
+
+// For basic arithmetic, we have more flexibility for short shifts (LSL shift <= 4),
+// which are a single uop, and for extended registers, which have full flexibility
+// across Unit A or B for both uops.
+def Ampere1BWrite_Arith : SchedWriteVariant<[
+                                SchedVar<RegExtendedPred, [Ampere1Write_2cyc_2AB]>,
+                                SchedVar<IsCheapLSL,      [Ampere1Write_1cyc_1AB]>,
+                                SchedVar<NoSchedPred,     [Ampere1Write_2cyc_1B_1AB]>]>;
+
+def Ampere1BWrite_ArithFlagsetting : SchedWriteVariant<[
+                                SchedVar<RegExtendedPred, [Ampere1Write_2cyc_2AB]>,
+                                SchedVar<IsCheapLSL,      [Ampere1Write_1cyc_1AB]>,
+                                SchedVar<NoSchedPred,     [Ampere1Write_2cyc_1B_1AB]>]>;
+
+//===----------------------------------------------------------------------===//
+// Map the target-defined scheduler read/write resources and latencies for Ampere-1.
+// This provides a coarse model, which is then specialised below.
+
+def : WriteRes<WriteImm,   [Ampere1BUnitAB]>;  // MOVN, MOVZ
+def : WriteRes<WriteI,     [Ampere1BUnitAB]>;  // ALU
+def : WriteRes<WriteISReg, [Ampere1BUnitB, Ampere1BUnitAB]> {
+  let Latency = 2;
+  let NumMicroOps = 2;
+}  // ALU of Shifted-Reg
+def : WriteRes<WriteIEReg, [Ampere1BUnitAB, Ampere1BUnitAB]> {
+  let Latency = 2;
+  let NumMicroOps = 2;
+}  // ALU of Extended-Reg
+def : WriteRes<WriteExtr,  [Ampere1BUnitB]>;  // EXTR shifts a reg pair
+def : WriteRes<WriteIS,    [Ampere1BUnitB]>;  // Shift/Scale
+def : WriteRes<WriteID32,  [Ampere1BUnitBS, Ampere1BUnitX]> {
+  let Latency = 13;
+}  // 32-bit Divide
+def : WriteRes<WriteID64,  [Ampere1BUnitBS, Ampere1BUnitX]> {
+  let Latency = 19;
+}  // 64-bit Divide
+def : WriteRes<WriteIM32,  [Ampere1BUnitBS]> {
+  let Latency = 3;
+}  // 32-bit Multiply
+def : WriteRes<WriteIM64,  [Ampere1BUnitBS, Ampere1UnitAB]> {
+  let Latency = 3;
+}  // 64-bit Multiply
+def : WriteRes<WriteBr,    [Ampere1BUnitA]>;
+def : WriteRes<WriteBrReg, [Ampere1BUnitA, Ampere1UnitA]>;
+def : WriteRes<WriteLD,    [Ampere1BUnitL]> {
+  let Latency = 3;
+}  // Load from base addr plus immediate offset
+def : WriteRes<WriteST,    [Ampere1BUnitS]> {
+  let Latency = 1;
+}  // Store to base addr plus immediate offset
+def : WriteRes<WriteSTP,   [Ampere1BUnitS, Ampere1BUnitS]> {
+  let Latency = 1;
+  let NumMicroOps = 1;
+}  // Store a register pair.
+def : WriteRes<WriteAdr,   [Ampere1BUnitAB]>;
+def : WriteRes<WriteLDIdx, [Ampere1BUnitAB, Ampere1BUnitS]> {
+  let Latency = 3;
+  let NumMicroOps = 1;
+}  // Load from a register index (maybe scaled).
+def : WriteRes<WriteSTIdx, [Ampere1BUnitS, Ampere1BUnitS]> {
+  let Latency = 1;
+  let NumMicroOps = 2;
+}  // Store to a register index (maybe scaled).
+def : WriteRes<WriteF,  [Ampere1BUnitXY]> {
+  let Latency = 2;
+}  // General floating-point ops.
+def : WriteRes<WriteFCmp,  [Ampere1BUnitX]> {
+  let Latency = 3;
+}  // Floating-point compare.
+def : WriteRes<WriteFCvt,  [Ampere1BUnitXY]> {
+  let Latency = 3;
+}  // Float conversion.
+def : WriteRes<WriteFCopy, [Ampere1BUnitXY]> {
+}  // Float-int register copy.
+def : WriteRes<WriteFImm,  [Ampere1BUnitXY]> {
+  let Latency = 2;
+}  // Float-int register copy.
+def : WriteRes<WriteFMul,  [Ampere1BUnitXY]> {
+  let Latency = 4;
+}  // Floating-point multiply.
+def : WriteRes<WriteFDiv,  [Ampere1BUnitXY]> {
+  let Latency = 19;
+}  // Floating-point division.
+def : WriteRes<WriteVd,    [Ampere1BUnitXY]> {
+  let Latency = 3;
+}  // 64bit Vector D ops.
+def : WriteRes<WriteVq,    [Ampere1BUnitXY]> {
+  let Latency = 3;
+}  // 128bit Vector Q ops.
+def : WriteRes<WriteVLD,   [Ampere1BUnitL, Ampere1BUnitL]> {
+  let Latency = 4;
+}  // Vector loads.
+def : WriteRes<WriteVST,   [Ampere1BUnitS, Ampere1BUnitZ]> {
+  let Latency = 2;
+}  // Vector stores.
+
+def : WriteRes<WriteAtomic,  []> { let Unsupported = 1; }
+
+def : WriteRes<WriteSys,     []> { let Latency = 1; }
+def : WriteRes<WriteBarrier, []> { let Latency = 1; }
+def : WriteRes<WriteHint,    []> { let Latency = 1; }
+
+def : WriteRes<WriteLDHi,    []> {
+  let Latency = 3;
+}  // The second register of a load-pair: LDP,LDPSW,LDNP,LDXP,LDAXP
+
+// Forwarding logic.
+def : ReadAdvance<ReadI,       0>;
+def : ReadAdvance<ReadISReg,   0>;
+def : ReadAdvance<ReadIEReg,   0>;
+def : ReadAdvance<ReadIM,      0>;
+def : ReadAdvance<ReadIMA,     1, [WriteIM32, WriteIM64]>;
+def : ReadAdvance<ReadID,      0>;
+def : ReadAdvance<ReadExtrHi,  0>;
+def : ReadAdvance<ReadST,      0>;
+def : ReadAdvance<ReadAdrBase, 0>;
+def : ReadAdvance<ReadVLD,     0>;
+
+//===----------------------------------------------------------------------===//
+// Specialising the scheduling model further for Ampere-1B.
+
+def : InstRW<[Ampere1BWrite_1cyc_1AB], (instrs COPY)>;
+
+// Branch instructions
+def : InstRW<[Ampere1BWrite_1cyc_1A], (instrs Bcc, BL, RET)>;
+def : InstRW<[Ampere1BWrite_1cyc_1A],
+        (instrs CBZW, CBZX, CBNZW, CBNZX, TBZW, TBZX, TBNZW, TBNZX)>;
+def : InstRW<[Ampere1BWrite_1cyc_2A], (instrs BLR)>;
+
+// Common Short Sequence Compression (CSSC)
+def : InstRW<[Ampere1BWrite_1cyc_1AB], (instrs ABS)>;
+def...
[truncated]

Copy link
Contributor

@jthackray jthackray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks neat and tidy. Can't see anything obviously wrong. LGTM.

@ptomsich
Copy link
Contributor Author

ptomsich commented Feb 9, 2024

Looks neat and tidy. Can't see anything obviously wrong. LGTM.

Thanks! Guess it shows that this isn't our first scheduling model ;-)

@ptomsich ptomsich merged commit 0144011 into llvm:main Feb 9, 2024
@jthackray
Copy link
Contributor

Er, the build appears to have failed, but you have merged?

@ptomsich
Copy link
Contributor Author

ptomsich commented Feb 9, 2024

Er, the build appears to have failed, but you have merged?

The build looked ok on my checkout. I am rebuilding again to confirm once more.

@jthackray
Copy link
Contributor

Thanks, just thought I'd check. Going to bed now.

@nico
Copy link
Contributor

nico commented Feb 9, 2024

http://45.33.8.238/linux/130453/step_4.txt

Did you try compiling before committing?

@ptomsich
Copy link
Contributor Author

ptomsich commented Feb 9, 2024

http://45.33.8.238/linux/130453/step_4.txt

Did you try compiling before committing?

Ninja didn't pick up the change of the .td file. I have the fix (a missing ',') ready, but am rerunning a clean rebuild.

ptomsich added a commit that referenced this pull request Feb 9, 2024
@ptomsich
Copy link
Contributor Author

ptomsich commented Feb 9, 2024

Pushed a revert and will reopen this PR.

@nico
Copy link
Contributor

nico commented Feb 10, 2024

http://45.33.8.238/linux/130453/step_4.txt
Did you try compiling before committing?

Ninja didn't pick up the change of the .td file. I have the fix (a missing ',') ready, but am rerunning a clean rebuild.

How can that happen? Does tblgen not write depfiles for included .td files?

ptomsich added a commit that referenced this pull request Feb 14, 2024
The Ampere1B core is enabled with a new scheduling/pipeline model, as it
provides significant updates over the Ampere1 core; it reduces latencies
on many instructions, has some micro-ops reassigned between the XY and X
units, and provides modelling for the instructions added since Ampere1
and Ampere1A.

As this is the first model implementing the CSSC instructions, we update
the UnsupportedFeatures on all other models (that have CompleteModel
set).
    
Testcases are added under llvm-mca: these showed the FullFP16 feature
missing, so we are adding it in as part of this commit.

This *adds tests and additional fixes* compared to the reverted #81338.
llvmbot pushed a commit to llvmbot/llvm-project that referenced this pull request Feb 15, 2024
The Ampere1B core is enabled with a new scheduling/pipeline model, as it
provides significant updates over the Ampere1 core; it reduces latencies
on many instructions, has some micro-ops reassigned between the XY and X
units, and provides modelling for the instructions added since Ampere1
and Ampere1A.

As this is the first model implementing the CSSC instructions, we update
the UnsupportedFeatures on all other models (that have CompleteModel
set).

Testcases are added under llvm-mca: these showed the FullFP16 feature
missing, so we are adding it in as part of this commit.

This *adds tests and additional fixes* compared to the reverted llvm#81338.

(cherry picked from commit dd1897c)
llvmbot pushed a commit to llvmbot/llvm-project that referenced this pull request Feb 27, 2024
The Ampere1B core is enabled with a new scheduling/pipeline model, as it
provides significant updates over the Ampere1 core; it reduces latencies
on many instructions, has some micro-ops reassigned between the XY and X
units, and provides modelling for the instructions added since Ampere1
and Ampere1A.

As this is the first model implementing the CSSC instructions, we update
the UnsupportedFeatures on all other models (that have CompleteModel
set).

Testcases are added under llvm-mca: these showed the FullFP16 feature
missing, so we are adding it in as part of this commit.

This *adds tests and additional fixes* compared to the reverted llvm#81338.

(cherry picked from commit dd1897c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants