Skip to content

[IR][LangRef] Add partial reduction add intrinsic #94499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 4, 2024
31 changes: 31 additions & 0 deletions llvm/docs/LangRef.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19209,6 +19209,37 @@ will be on any later loop iteration.
This intrinsic will only return 0 if the input count is also 0. A non-zero input
count will produce a non-zero result.

'``llvm.experimental.vector.partial.reduce.add.*``' Intrinsic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""
This is an overloaded intrinsic.

::

declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v4i32.v8i32(<4 x i32> %a, <8 x i32> %b)
declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v4i32.v16i32(<4 x i32> %a, <16 x i32> %b)
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv8i32(<vscale x 4 x i32> %a, <vscale x 8 x i32> %b)
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv16i32(<vscale x 4 x i32> %a, <vscale x 16 x i32> %b)

Overview:
"""""""""

The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics reduce the
concatenation of the two vector operands down to the number of elements dictated
by the result type. The result type is a vector type that matches the type of the
first operand vector.

Arguments:
""""""""""

Both arguments must be vectors of matching element types. The first argument type must
match the result type, while the second argument type must have a vector length that is a
positive integer multiple of the first vector/result type. The arguments must be either be
both fixed or both scalable vectors.


'``llvm.experimental.vector.histogram.*``' Intrinsic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
6 changes: 6 additions & 0 deletions llvm/include/llvm/IR/Intrinsics.td
Original file line number Diff line number Diff line change
Expand Up @@ -2635,6 +2635,12 @@ def int_vector_deinterleave2 : DefaultAttrsIntrinsic<[LLVMHalfElementsVectorType
[llvm_anyvector_ty],
[IntrNoMem]>;

//===-------------- Intrinsics to perform partial reduction ---------------===//

def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
[llvm_anyvector_ty, llvm_anyvector_ty],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a new matcher class to constrain the second parameter to the restrictions you defined in the langref would be helpful (same element type, width an integer multiple).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is an experimental intrinsic is it worth implementing that plumbing?

Also, the matcher classes typically exist to allow for fewer explicit types when creating a call, which in this instance is not possible because both vector lengths are unknown (or to put another way, there's no 1-1 link between them).

Personally I think there verifier route is better, plus it allow for a more user friendly error message.

[IntrNoMem]>;

//===----------------- Pointer Authentication Intrinsics ------------------===//
//

Expand Down
32 changes: 32 additions & 0 deletions llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@
#include "llvm/TargetParser/Triple.h"
#include "llvm/Transforms/Utils/Local.h"
#include <cstddef>
#include <deque>
#include <iterator>
#include <limits>
#include <optional>
Expand Down Expand Up @@ -7914,6 +7915,37 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
setValue(&I, Trunc);
return;
}
case Intrinsic::experimental_vector_partial_reduce_add: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can pass this through as an INTRINSIC_WO_CHAIN node, at least for targets that support it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful because I don't think common code exists to type legalise arbitrary INTRINSIC_WO_CHAIN calls (given their nature). Presumably we'll just follow the precedent set for get.active.lane.mask and cttz.elts when we add AArch64 specific lowering.

I can't help but think as some point we'll just want to restrict the "same element type" restrict of VECREDUCE_ADD to have explicit signed and unsigned versions, like we have for ABDS/ABDU, but I guess we can see how things work out (again much as we are for the intrinsics mentioned before).

SDValue OpNode = getValue(I.getOperand(1));
EVT ReducedTy = EVT::getEVT(I.getType());
EVT FullTy = OpNode.getValueType();

unsigned Stride = ReducedTy.getVectorMinNumElements();
unsigned ScaleFactor = FullTy.getVectorMinNumElements() / Stride;

// Collect all of the subvectors
std::deque<SDValue> Subvectors;
Subvectors.push_back(getValue(I.getOperand(0)));
for (unsigned i = 0; i < ScaleFactor; i++) {
auto SourceIndex = DAG.getVectorIdxConstant(i * Stride, sdl);
Subvectors.push_back(DAG.getNode(ISD::EXTRACT_SUBVECTOR, sdl, ReducedTy,
{OpNode, SourceIndex}));
}

// Flatten the subvector tree
while (Subvectors.size() > 1) {
Subvectors.push_back(DAG.getNode(ISD::ADD, sdl, ReducedTy,
{Subvectors[0], Subvectors[1]}));
Subvectors.pop_front();
Subvectors.pop_front();
}

assert(Subvectors.size() == 1 &&
"There should only be one subvector after tree flattening");

setValue(&I, Subvectors[0]);
return;
}
case Intrinsic::experimental_cttz_elts: {
auto DL = getCurSDLoc();
SDValue Op = getValue(I.getOperand(0));
Expand Down
14 changes: 14 additions & 0 deletions llvm/lib/IR/Verifier.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6131,6 +6131,20 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
}
break;
}
case Intrinsic::experimental_vector_partial_reduce_add: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my matcher class suggestion would remove the need for this code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above for my 2c.

VectorType *AccTy = cast<VectorType>(Call.getArgOperand(0)->getType());
VectorType *VecTy = cast<VectorType>(Call.getArgOperand(1)->getType());

unsigned VecWidth = VecTy->getElementCount().getKnownMinValue();
unsigned AccWidth = AccTy->getElementCount().getKnownMinValue();

Check((VecWidth % AccWidth) == 0,
"Invalid vector widths for partial "
"reduction. The width of the input vector "
"must be a positive integer multiple of "
"the width of the accumulator vector.");
break;
}
case Intrinsic::experimental_noalias_scope_decl: {
NoAliasScopeDecls.push_back(cast<IntrinsicInst>(&Call));
break;
Expand Down
83 changes: 83 additions & 0 deletions llvm/test/CodeGen/AArch64/partial-reduction-add.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
; RUN: llc -force-vector-interleave=1 -o - %s | FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-none-unknown-elf"

define <4 x i32> @partial_reduce_add_fixed(<4 x i32> %accumulator, <4 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add_fixed:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: add v0.4s, v0.4s, v1.4s
; CHECK-NEXT: ret
entry:
%partial.reduce = call <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v4i32.v4i32(<4 x i32> %accumulator, <4 x i32> %0)
ret <4 x i32> %partial.reduce
}

define <4 x i32> @partial_reduce_add_fixed_half(<4 x i32> %accumulator, <8 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add_fixed_half:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: add v0.4s, v0.4s, v1.4s
; CHECK-NEXT: add v0.4s, v2.4s, v0.4s
; CHECK-NEXT: ret
entry:
%partial.reduce = call <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v4i32.v8i32(<4 x i32> %accumulator, <8 x i32> %0)
ret <4 x i32> %partial.reduce
}

define <vscale x 4 x i32> @partial_reduce_add(<vscale x 4 x i32> %accumulator, <vscale x 4 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: add z0.s, z0.s, z1.s
; CHECK-NEXT: ret
entry:
%partial.reduce = call <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv4i32(<vscale x 4 x i32> %accumulator, <vscale x 4 x i32> %0)
ret <vscale x 4 x i32> %partial.reduce
}

define <vscale x 4 x i32> @partial_reduce_add_half(<vscale x 4 x i32> %accumulator, <vscale x 8 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add_half:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: add z0.s, z0.s, z1.s
; CHECK-NEXT: add z0.s, z2.s, z0.s
; CHECK-NEXT: ret
entry:
%partial.reduce = call <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv8i32(<vscale x 4 x i32> %accumulator, <vscale x 8 x i32> %0)
ret <vscale x 4 x i32> %partial.reduce
}

define <vscale x 4 x i32> @partial_reduce_add_quart(<vscale x 4 x i32> %accumulator, <vscale x 16 x i32> %0) #0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reducing into the first 4 elements of the accumulator; it doesn't work correctly with vscale.

; CHECK-LABEL: partial_reduce_add_quart:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: add z0.s, z0.s, z1.s
; CHECK-NEXT: add z2.s, z2.s, z3.s
; CHECK-NEXT: add z0.s, z4.s, z0.s
; CHECK-NEXT: add z0.s, z2.s, z0.s
; CHECK-NEXT: ret
entry:
%partial.reduce = call <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv16i32(<vscale x 4 x i32> %accumulator, <vscale x 16 x i32> %0)
ret <vscale x 4 x i32> %partial.reduce
}

define <vscale x 8 x i32> @partial_reduce_add_half_8(<vscale x 8 x i32> %accumulator, <vscale x 16 x i32> %0) #0 {
; CHECK-LABEL: partial_reduce_add_half_8:
; CHECK: // %bb.0: // %entry
; CHECK-NEXT: add z0.s, z0.s, z2.s
; CHECK-NEXT: add z1.s, z1.s, z3.s
; CHECK-NEXT: add z0.s, z4.s, z0.s
; CHECK-NEXT: add z1.s, z5.s, z1.s
; CHECK-NEXT: ret
entry:
%partial.reduce = call <vscale x 8 x i32> @llvm.experimental.vector.partial.reduce.add.nxv8i32.nxv8i32.nxv16i32(<vscale x 8 x i32> %accumulator, <vscale x 16 x i32> %0)
ret <vscale x 8 x i32> %partial.reduce
}

declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv4i32(<vscale x 4 x i32>, <vscale x 4 x i32>)
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv8i32(<vscale x 4 x i32>, <vscale x 8 x i32>)
declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv4i32.nxv16i32(<vscale x 4 x i32>, <vscale x 16 x i32>)
declare <vscale x 8 x i32> @llvm.experimental.vector.partial.reduce.add.nxv8i32.nxv8i32.nxv16i32(<vscale x 8 x i32>, <vscale x 16 x i32>)

declare i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32>)
declare i32 @llvm.vector.reduce.add.nxv8i32(<vscale x 8 x i32>)

attributes #0 = { "target-features"="+sve2" }