Skip to content

[instcombine, sha] Add Instruction Fold for Masked Merge #7145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
llvmbot opened this issue Apr 4, 2010 · 25 comments
Open

[instcombine, sha] Add Instruction Fold for Masked Merge #7145

llvmbot opened this issue Apr 4, 2010 · 25 comments
Assignees
Labels

Comments

@llvmbot
Copy link
Member

llvmbot commented Apr 4, 2010

Bugzilla Link 6773
Version trunk
OS Linux
Blocks llvm/llvm-bugzilla-archive#49930
Reporter LLVM Bugzilla Contributor
CC @asl,@d0k,@lattner,@topperc,@efriedma-quic,@LebedevRI,@RKSimon,@sunfishcode,@nlewycky,@rotateright
Fixed by commit(s) 333106

Extended Description

When experimenting with SHA, I was pleased to notice that LLVM will fold the "Maj" function (x & y) ^ (x & z) ^ (y & z) down to ((z ^ y) & x) ^ (y & z).

The "Ch" function, though, doesn't fold. (x & y) | (~x & z) should become ((y ^ z) & x) ^ z, as mentioned on http://graphics.stanford.edu/~seander/bithacks.html#MaskedMerge (as should (x & y) ^ (~x & z), the version used in the SHA standard).

(If you're wondering at the similarity, Maj(x,y,z) is equivalent to Ch(x, y|z, y&z). Using that implementation with the optimized Ch again gives optimal code from LLVM as it knows to fold (y|z)^(y&z) down to y^z.)

LLVM IR from Bitter Melon:

define i32 @Ch(i32 %x, i32 %y, i32 %z) nounwind readnone {
entry:
  %0 = and i32 %y, %x                             ; <i32> [#uses=1]
  %not = xor i32 %x, -1                           ; <i32> [#uses=1]
  %1 = and i32 %z, %not                           ; <i32> [#uses=1]
  %2 = xor i32 %1, %0                             ; <i32> [#uses=1]
  ret i32 %2
}

define i32 @Maj(i32 %x, i32 %y, i32 %z) nounwind readnone {
entry:
  %0 = xor i32 %z, %y                             ; <i32> [#uses=1]
  %1 = and i32 %0, %x                             ; <i32> [#uses=1]
  %2 = and i32 %z, %y                             ; <i32> [#uses=1]
  %3 = xor i32 %1, %2                             ; <i32> [#uses=1]
  ret i32 %3
}

define i32 @Ch2(i32 %x, i32 %y, i32 %z) nounwind readnone {
entry:
  %0 = xor i32 %z, %y                             ; <i32> [#uses=1]
  %1 = and i32 %0, %x                             ; <i32> [#uses=1]
  %2 = xor i32 %1, %z                             ; <i32> [#uses=1]
  ret i32 %2
}

define i32 @Maj2(i32 %x, i32 %y, i32 %z) nounwind readnone {
entry:
  %0 = and i32 %z, %y                             ; <i32> [#uses=1]
  %1 = xor i32 %z, %y                             ; <i32> [#uses=1]
  %2 = and i32 %1, %x                             ; <i32> [#uses=1]
  %3 = xor i32 %2, %0                             ; <i32> [#uses=1]
  ret i32 %3
}

From the following C source:

int Ch(int x, int y, int z) {
   return (x&y) ^ (~x&z);
}
int Maj(int x, int y, int z) {
   return (x&y) ^ (x&z) ^ (y&z);
}

int Ch2(int x, int y, int z) {
   return z ^ ((y ^ z) & x);
}
int Maj2(int x, int y, int z) {
   return Ch2(x, y|z, y&z);
}
@llvmbot
Copy link
Member Author

llvmbot commented Apr 4, 2010

assigned to @LebedevRI

@d0k
Copy link
Member

d0k commented Jul 12, 2010

Thanks for the nice testcase, optimization implemented in r108136.

@d0k
Copy link
Member

d0k commented Jul 12, 2010

And reverted in r108141 … I hope to fix this later today.

@lattner
Copy link
Collaborator

lattner commented Jul 12, 2010

Thanks for working on this Benjamin!

@lattner
Copy link
Collaborator

lattner commented Jul 14, 2010

Benjamin: I haven't tracked down the root issue here, but I strongly suspect that this instcombine is just exposing a bug elsewhere in the compiler. An interesting aspect of this is that (on the clang build) the xform triggers either exclusively or almost so for cases when C/D are constants. I'm seeing tons of stuff like this:

XFORM: %6 = and i8 %5, -2 ; [#uses=1]
XFORM: %4 = and i8 %2, 1 ; [#uses=1]
XFORM: %8 = or i8 %4, %6 ; [#uses=1]

This is because m_Not is matching on constants, which is dubious. This should certainly be turned off for constants, because the and/or sequence is better for most targets. However, it looks like we're missing the inverse transformation:

$ cat t.c
int test1(int A, int B) {
return (A&-8388481)|(B&8388480);
}

int test2(int A, int B) {
return ((A^B)&8388480)^A;
}
$ llvm-gcc t.c -S -o - -O3 -m64 -fomit-frame-pointer -mkernel
..
_test1:
andl $-8388481, %edi
andl $8388480, %esi
leal (%rsi,%rdi), %eax
ret
...
test2:
xorl %edi, %esi
andl $8388480, %esi
movl %esi, %eax
xorl %edi, %eax
ret

We should turn the later into the former in the case of a constant. I will continue investigating the latent bug, many of these are triggering on SRoA-created i128's, i256's etc, we really want to keep these and/or and not turn them into xor. However, I want to find and fix the latent bug.

@efriedma-quic
Copy link
Collaborator

This is because m_Not is matching on constants, which is dubious. This should
certainly be turned off for constants, because the and/or sequence is better
for most targets.

Actually, for armv5, we generate better code for the xor form, but I think I'd consider that a deficiency in the backend.

@d0k
Copy link
Member

d0k commented Jul 14, 2010

patch v2
The newest version of my patch, doesn't transform constants but catches some cases with flipped operands. I don't know if it's worth to add code to catch all possible cases.

@LebedevRI
Copy link
Member

https://godbolt.org/g/KafbjK
https://rise4fun.com/Alive/AOe9

This looks pretty straight forward, so i could take a look.
Any idea if the issues that resulted in the reverts were resolved?

@rotateright
Copy link
Contributor

https://godbolt.org/g/KafbjK
https://rise4fun.com/Alive/AOe9

This looks pretty straight forward, so i could take a look.
Any idea if the issues that resulted in the reverts were resolved?

I don't know the history, but I wonder if it would be better to treat this as 2 problems rather than 2 specific folds as in https://reviews.llvm.org/rL108141 .

First, reduce the final xor (and also 'add') to 'or' when the operands have no common bits set:

Name: xor_to_or
%xy = and i32 %x, %y
%notx = xor i32 %x, -1
%nxz = and i32 %notx, %z
%r = xor i32 %xy, %nxz
=>
%r = or i32 %xy, %nxz

Name: add_to_or
%xy = and i32 %x, %y
%notx = xor i32 %x, -1
%nxz = and i32 %notx, %z
%r = add i32 %xy, %nxz
=>
%r = or i32 %xy, %nxz

https://rise4fun.com/Alive/u6f

Inverted operands to 'and' ops might be worth adding to llvm::haveNoCommonBitsSet().

@rotateright
Copy link
Contributor

Another consideration: does the backend recognize the sequential xor sequence if a target has 'andn' or a bit-select instruction?

Eg, x86 with BMI:
andl %edx, %edi
andnl %esi, %edx, %eax
orl %edi, %eax

The 'not' isn't a separate instruction here, so it should always be better than the alternative (better throughput than sequential xor+and+xor because the 'and' and 'andn' can execute in parallel).

AArch64 has 'bsl':
Bitwise Select. This instruction sets each bit in the destination SIMD&FP register to the corresponding bit from the first source SIMD&FP register when the original destination bit was 1, otherwise from the second source SIMD&FP register.

So I think you have to make sure that the IR optimization doesn't cause a regression for this:

define <4 x i32> @​v(<4 x i32> %x, <4 x i32> %y, <4 x i32> %mask) {
%mx = and <4 x i32> %x, %mask
%notmask = xor <4 x i32> %mask, <i32 -1, i32 -1, i32 -1, i32 -1>
%my = and <4 x i32> %y, %notmask
%r = or <4 x i32> %mx, %my
ret <4 x i32> %r
}

@LebedevRI
Copy link
Member

Another consideration: does the backend recognize the sequential xor
sequence if a target has 'andn' or a bit-select instruction?

Eg, x86 with BMI:
andl %edx, %edi
andnl %esi, %edx, %eax
orl %edi, %eax

The 'not' isn't a separate instruction here, so it should always be better
than the alternative (better throughput than sequential xor+and+xor because
the 'and' and 'andn' can execute in parallel).

AArch64 has 'bsl':
Bitwise Select. This instruction sets each bit in the destination SIMD&FP
register to the corresponding bit from the first source SIMD&FP register
when the original destination bit was 1, otherwise from the second source
SIMD&FP register.

So I think you have to make sure that the IR optimization doesn't cause a
regression for this:

define <4 x i32> @​v(<4 x i32> %x, <4 x i32> %y, <4 x i32> %mask) {
%mx = and <4 x i32> %x, %mask
%notmask = xor <4 x i32> %mask, <i32 -1, i32 -1, i32 -1, i32 -1>
%my = and <4 x i32> %y, %notmask
%r = or <4 x i32> %mx, %my
ret <4 x i32> %r
}

So in other words, the backend (what specifically?) should be able to unfold:
https://rise4fun.com/Alive/HqI
https://godbolt.org/g/skwsvY
?

I think that

@LebedevRI
Copy link
Member

Eh, meant to post that into llvm/llvm-bugzilla-archive#37104

@rotateright
Copy link
Contributor

This looks pretty straight forward, so i could take a look.

So this actually turns out to be more complicated. I see at least 5 steps:

  1. DAGCombiner: allow targets with 'andn' functionality to optimize the xor sequence.
  2. DAGCombiner: allow targets with 'bsl' functionality to optimize the xor sequence.
  3. InstCombine: with no common bits set, convert 'xor' to 'or' (comment 8).
  4. InstCombine: with no common bits set, convert 'add' to 'or' (comment 8).
  5. InstCombine: convert masked merge (starting for 'or') to 'xor' sequence (comment 7).

#​1 and #​2 are prerequisites for #​5.
#​3 and #​4 look like generally good IR canonicalizations to me with no potential backend concerns, but others may see it differently.

@efriedma-quic
Copy link
Collaborator

We already do add->or and xor->or, IIRC.

@LebedevRI
Copy link
Member

We already do add->or and xor->or, IIRC.

Based on the testcase output, no.

@efriedma-quic
Copy link
Collaborator

Based on the testcase output, no.

I know we do it in some cases, but maybe not your testcase, specifically.

@rotateright
Copy link
Contributor

Based on the testcase output, no.

I know we do it in some cases, but maybe not your testcase, specifically.

Correct - we have haveNoCommonBitsSet() called from visitAdd, but not visitXor. But it doesn't check for the case of a not'ed mask. SimplifyDemandedBits will also transform to 'or', but it can't handle the cases here AFAICT because we don't know the mask bits are constants, just that they're opposite.

@LebedevRI
Copy link
Member

Committed tests for now https://reviews.llvm.org/rL330003

@LebedevRI
Copy link
Member

This looks pretty straight forward, so i could take a look.

So this actually turns out to be more complicated. I see at least 5 steps:

  1. DAGCombiner: allow targets with 'andn' functionality to optimize the xor
    sequence.
  2. DAGCombiner: allow targets with 'bsl' functionality to optimize the xor
    sequence.
    I guess i will look into those ^ next..
  1. InstCombine: with no common bits set, convert 'xor' to 'or' (comment 8).
    https://reviews.llvm.org/rL330103
  1. InstCombine: with no common bits set, convert 'add' to 'or' (comment 8).
    https://reviews.llvm.org/rL330101
  1. InstCombine: convert masked merge (starting for 'or') to 'xor' sequence
    (comment 7).

#​1 and #​2 are prerequisites for #​5.
#​3 and #​4 look like generally good IR canonicalizations to me with no
potential backend concerns, but others may see it differently.

  1. Canonicalize the masked merge pattern itself: https://reviews.llvm.org/D45664 https://reviews.llvm.org/D45655

@LebedevRI
Copy link
Member

Backend is all done.
instcombine part is finally up https://reviews.llvm.org/D46814.

Some further polishing will be needed - when the mask is constant,
canonicalize more of and-of-or's and add/xor-of-and's patterns to or-of-and's.

@LebedevRI
Copy link
Member

Committed in r333106!

@LebedevRI
Copy link
Member

And what would you think, reverted in r333631 :)
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180528/556578.html

@RKSimon
Copy link
Collaborator

RKSimon commented Jun 2, 2018

Roman,

This is (partially) reduced from https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=8619,

The oss bug was 'fixed' when you reverted your masked merge patch (D46814/rL333106) at rL333631, but if you intend to re-apply with an updated patch, then please ensure you add a fix for this:

@g = extern_weak global i32

; Function Attrs: nounwind
define void @function() #0 {
entry:
  %A14 = alloca i32
  %L4 = load i32, i32* %A14
  %A9 = alloca i32
  %C26 = icmp ne i32 0, %L4
  %or4 = or i32 or (i32 zext (i1 icmp eq (i32* @g, i32* null) to i32), i32 1), 65536
  %B22 = srem i32 %or4, %or4
  %B26 = or i32 %B22, %B22
  %C5 = icmp sle i32 %L4, 0
  %C9 = icmp eq i32 %B26, 0
  br label %BB

BB:                                               ; preds = %BB, %entry
  %C8 = icmp sge i1 %C9, %C5
  %C6 = icmp uge i1 %C9, %C26
  br i1 undef, label %BB, label %BB2

BB2:                                              ; preds = %BB
  store i1 %C8, i1* undef
  %L13 = load i1, i1* undef
  %B44 = mul i1 %L13, %C6
  store i1 %B44, i1* undef
  store i32* null, i32** undef
  ret void
}

attributes #0 = { nounwind }

@LebedevRI
Copy link
Member

LebedevRI commented Jun 2, 2018

Roman,

This is (partially) reduced from
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=8619,

The oss bug was 'fixed' when you reverted your masked merge patch
(D46814/rL333106) at rL333631,
Thank you for bringing this up.
That assertion failure is quite unexpected.

but if you intend to re-apply with an updated
patch, then please ensure you add a fix for this:
The revert reason is completely alien to me,
i don't have a clue how it could possibly be resolved.
Other than sprinkling y with freeze(), once that is here.

So no, i'll be surprised if it'll land again.

@g = extern_weak global i32

; Function Attrs: nounwind
define void @function() #0 {
entry:
  %A14 = alloca i32
  %L4 = load i32, i32* %A14
  %A9 = alloca i32
  %C26 = icmp ne i32 0, %L4
  %or4 = or i32 or (i32 zext (i1 icmp eq (i32* @g, i32* null) to i32), i32
1), 65536
  %B22 = srem i32 %or4, %or4
  %B26 = or i32 %B22, %B22
  %C5 = icmp sle i32 %L4, 0
  %C9 = icmp eq i32 %B26, 0
  br label %BB

BB:                                               ; preds = %BB, %entry
  %C8 = icmp sge i1 %C9, %C5
  %C6 = icmp uge i1 %C9, %C26
  br i1 undef, label %BB, label %BB2

BB2:                                              ; preds = %BB
  store i1 %C8, i1* undef
  %L13 = load i1, i1* undef
  %B44 = mul i1 %L13, %C6
  store i1 %B44, i1* undef
  store i32* null, i32** undef
  ret void
}

attributes #0 = { nounwind }

@rotateright
Copy link
Contributor

rotateright commented Jun 3, 2018

The revert reason is completely alien to me,
i don't have a clue how it could possibly be resolved.
Other than sprinkling y with freeze(), once that is here.

So no, i'll be surprised if it'll land again.

This is unfortunate, but I think you're right. We can't do this without freeze.

For reference, the failing case was based on something like this:

unsigned f(unsigned mask, unsigned x, unsigned y) {
  return (mask & x) | (~mask & y);
}

void g(unsigned x) {
  unsigned y;
  unsigned z = f(-1, x, y);
  // use z
}

And apparently that's legal C code...

But as you know, there's still activity/hope for freeze. :)
https://reviews.llvm.org/D29013
https://reviews.llvm.org/D29014

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 3, 2021
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this issue Aug 24, 2023
MasuhiroYamada pushed a commit to RIKEN-RCCS/llvm-project that referenced this issue Oct 1, 2024
MasuhiroYamada pushed a commit to RIKEN-RCCS/llvm-project that referenced this issue Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants