Skip to content

Missed optimization: per-byte comparisons are not always combined #117853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
purplesyringa opened this issue Nov 27, 2024 · 8 comments · May be fixed by #133817
Open

Missed optimization: per-byte comparisons are not always combined #117853

purplesyringa opened this issue Nov 27, 2024 · 8 comments · May be fixed by #133817

Comments

@purplesyringa
Copy link

Reproducer: https://godbolt.org/z/6caGGoo1e

int is_all_ones(unsigned char *p) {
    unsigned char a = p[0];
    unsigned char b = p[1];
    return a == 255 && b == 255;
}

I've expected this to compile to a single read + comparison on platforms that support unaligned accesses. Instead, this happened:

is_all_ones:
        movzx   ecx, byte ptr [rdi + 1]
        and     cl, byte ptr [rdi]
        xor     eax, eax
        cmp     cl, -1
        sete    al
        ret
@purplesyringa purplesyringa changed the title Missed optimization: per-byte comparisons are not always combined properly Missed optimization: per-byte comparisons are not always combined Nov 27, 2024
@RKSimon
Copy link
Collaborator

RKSimon commented Nov 27, 2024

define range(i32 0, 2) i32 @is_all_ones(ptr nocapture noundef readonly %p) {
entry:
  %0 = load i8, ptr %p, align 1
  %arrayidx1 = getelementptr inbounds i8, ptr %p, i64 1
  %1 = load i8, ptr %arrayidx1, align 1
  %cmp = icmp eq i8 %0, -1
  %cmp3 = icmp eq i8 %1, -1
  %2 = select i1 %cmp, i1 %cmp3, i1 false
  %conv4 = zext i1 %2 to i32
  ret i32 %conv4
}

@PhilippRados
Copy link
Contributor

@purplesyringa so you mean only requiring a single word-read if both elements are next to each other like this?

_is_all_ones:
        movzx  ecx, word ptr [rdi]
        xor    eax, eax
        cmp    cx, -1
        sete   al
        ret

@purplesyringa
Copy link
Author

Yes (obviously, with arbitrary constants, not just -1). Ideally,

int is_all_ones(unsigned char *p, unsigned char *q) {
    return (p[0] == q[0]) & (p[1] == q[1]);
}

would compile to a single comparison, too.

@PhilippRados
Copy link
Contributor

@RKSimon Do you have a suggestion where this should be implemented (should it even)? I saw that there was once a dedicated LoadCombine pass which was removed though.

According to this discussion merging adjacent loads is still being performed (as of 2019) however I couldn't pinpoint where exactly (InstCombine, MemCpyOptimizer, SelectionDAG?).

@nikic
Copy link
Contributor

nikic commented Jan 26, 2025

In the middle-end, a possible place to do it would be AggressiveInstCombine, which already has a transform for merging loads (but not targeting comparisons).

@nikic
Copy link
Contributor

nikic commented Jan 26, 2025

We could also view this in terms of forming a memcmp (which later gets reexpanded) via MergeICmps. I think that pass may be more convenient in terms of existing infrastructure to match such patterns (including for the non-trivial cases spread across many blocks).

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 27, 2025

FTR, if this had been 4 or more (pow2) comparisons, SLP would have caught it - but it avoids creating reductions of 2 x vectors.

int is_all_ones(unsigned char *p, unsigned char *q) {
    return (p[0] == q[0]) & (p[1] == q[1]) & (p[2] == q[2]) & (p[3] == q[3]);
}

@PhilippRados
Copy link
Contributor

PhilippRados commented Jan 30, 2025

@nikic I looked at the existing code in MergeICmps which uses these BCECmpChains to track comparisons with between different pointers with the same offsets across basic blocks. That pass is currently (it seems) overly restrictive in the sense that it can only be triggered by phi nodes (not by select like in the example above), the blocks should only contain a single compare (or if splittable), and the phi value has to come from another icmp directly.
Because of this, and to get a feel for the optimization, I just implemented a rough version of this optimization after the existing merge process. This is the current implementation for single basic blocks, let me know if you have any thoughts:

  1. Iterate through the instructions in reverse to find the root select (similar to other pass which is triggered by phi)
  2. Recursively check its operands to see that they are of the correct form and that they both have the base-ptr (in this case the argument %p) while keeping track of the offset from said base-ptr
  3. Check that the offsets are adjacent and if so merge them and emit an appropriate memcmp call

The emitted IR for the example from above looks like this:

define dso_local range(i32 0, 2) i32 @is_all_ones(ptr nocapture noundef readonly %p) local_unnamed_addr {
entry:
  %2 = alloca [2 x i8], align 1
  store [2 x i8] c"\FF\FF", ptr %2, align 1
  %memcmp = call i32 @memcmp(ptr %p, ptr %2, i64 2)
  %3 = icmp eq i32 %memcmp, 0
  %land.ext = zext i1 %3 to i32
  ret i32 %land.ext
}

The emitted assembly looks like this:

	movw	$-1, -2(%rbp)
	xorl	%eax, %eax
	cmpw	$-1, (%rdi)
	sete	%al
	retq

The first movw isn't necessary, but I guess it is not optimized away since memory lives outside of SSA. Is there a way to emit a pointer to the constant array in the IR directly without allocating? Or can I annotate the alloca to tell the compiler that it is only used for the memcmp and can be deleted otherwise?

I'll look into combining this merge-type with the existing code next (I have to refine the implementation a lot more to take care of edge-cases before that though). This might take a while since I'm busy next week but I'll submit a draft once I have a basic working pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants