Missed optimization: per-byte comparisons are not always combined #117853

purplesyringa · 2024-11-27T08:09:13Z

Reproducer: https://godbolt.org/z/6caGGoo1e

int is_all_ones(unsigned char *p) {
    unsigned char a = p[0];
    unsigned char b = p[1];
    return a == 255 && b == 255;
}

I've expected this to compile to a single read + comparison on platforms that support unaligned accesses. Instead, this happened:

is_all_ones:
        movzx   ecx, byte ptr [rdi + 1]
        and     cl, byte ptr [rdi]
        xor     eax, eax
        cmp     cl, -1
        sete    al
        ret

The text was updated successfully, but these errors were encountered:

RKSimon · 2024-11-27T09:25:15Z

define range(i32 0, 2) i32 @is_all_ones(ptr nocapture noundef readonly %p) {
entry:
  %0 = load i8, ptr %p, align 1
  %arrayidx1 = getelementptr inbounds i8, ptr %p, i64 1
  %1 = load i8, ptr %arrayidx1, align 1
  %cmp = icmp eq i8 %0, -1
  %cmp3 = icmp eq i8 %1, -1
  %2 = select i1 %cmp, i1 %cmp3, i1 false
  %conv4 = zext i1 %2 to i32
  ret i32 %conv4
}

PhilippRados · 2025-01-26T16:16:49Z

@purplesyringa so you mean only requiring a single word-read if both elements are next to each other like this?

_is_all_ones:
        movzx  ecx, word ptr [rdi]
        xor    eax, eax
        cmp    cx, -1
        sete   al
        ret

purplesyringa · 2025-01-26T16:38:16Z

Yes (obviously, with arbitrary constants, not just -1). Ideally,

int is_all_ones(unsigned char *p, unsigned char *q) {
    return (p[0] == q[0]) & (p[1] == q[1]);
}

would compile to a single comparison, too.

PhilippRados · 2025-01-26T21:40:25Z

@RKSimon Do you have a suggestion where this should be implemented (should it even)? I saw that there was once a dedicated LoadCombine pass which was removed though.

According to this discussion merging adjacent loads is still being performed (as of 2019) however I couldn't pinpoint where exactly (InstCombine, MemCpyOptimizer, SelectionDAG?).

nikic · 2025-01-26T22:14:55Z

In the middle-end, a possible place to do it would be AggressiveInstCombine, which already has a transform for merging loads (but not targeting comparisons).

nikic · 2025-01-26T22:21:21Z

We could also view this in terms of forming a memcmp (which later gets reexpanded) via MergeICmps. I think that pass may be more convenient in terms of existing infrastructure to match such patterns (including for the non-trivial cases spread across many blocks).

RKSimon · 2025-01-27T10:25:05Z

FTR, if this had been 4 or more (pow2) comparisons, SLP would have caught it - but it avoids creating reductions of 2 x vectors.

int is_all_ones(unsigned char *p, unsigned char *q) {
    return (p[0] == q[0]) & (p[1] == q[1]) & (p[2] == q[2]) & (p[3] == q[3]);
}

PhilippRados · 2025-01-30T16:15:40Z

@nikic I looked at the existing code in MergeICmps which uses these BCECmpChains to track comparisons with between different pointers with the same offsets across basic blocks. That pass is currently (it seems) overly restrictive in the sense that it can only be triggered by phi nodes (not by select like in the example above), the blocks should only contain a single compare (or if splittable), and the phi value has to come from another icmp directly.
Because of this, and to get a feel for the optimization, I just implemented a rough version of this optimization after the existing merge process. This is the current implementation for single basic blocks, let me know if you have any thoughts:

Iterate through the instructions in reverse to find the root select (similar to other pass which is triggered by phi)
Recursively check its operands to see that they are of the correct form and that they both have the base-ptr (in this case the argument %p) while keeping track of the offset from said base-ptr
Check that the offsets are adjacent and if so merge them and emit an appropriate memcmp call

The emitted IR for the example from above looks like this:

define dso_local range(i32 0, 2) i32 @is_all_ones(ptr nocapture noundef readonly %p) local_unnamed_addr {
entry:
  %2 = alloca [2 x i8], align 1
  store [2 x i8] c"\FF\FF", ptr %2, align 1
  %memcmp = call i32 @memcmp(ptr %p, ptr %2, i64 2)
  %3 = icmp eq i32 %memcmp, 0
  %land.ext = zext i1 %3 to i32
  ret i32 %land.ext
}

The emitted assembly looks like this:

	movw	$-1, -2(%rbp)
	xorl	%eax, %eax
	cmpw	$-1, (%rdi)
	sete	%al
	retq

The first movw isn't necessary, but I guess it is not optimized away since memory lives outside of SSA. Is there a way to emit a pointer to the constant array in the IR directly without allocating? Or can I annotate the alloca to tell the compiler that it is only used for the memcmp and can be deleted otherwise?

I'll look into combining this merge-type with the existing code next (I have to refine the implementation a lot more to take care of edge-cases before that though). This might take a while since I'm busy next week but I'll submit a draft once I have a basic working pass.

github-actions bot added the new issue label Nov 27, 2024

purplesyringa changed the title ~~Missed optimization: per-byte comparisons are not always combined properly~~ Missed optimization: per-byte comparisons are not always combined Nov 27, 2024

purplesyringa mentioned this issue Nov 27, 2024

Missed optimization when looping over bytes of a value rust-lang/rust#133528

Closed

RKSimon added missed-optimization and removed new issue labels Nov 27, 2024

EugeneZelenko added the llvm:optimizations label Nov 27, 2024

PhilippRados linked a pull request Mar 31, 2025 that will close this issue

[MergeICmps] Merge adjacent comparisons to constants #133817

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missed optimization: per-byte comparisons are not always combined #117853

Missed optimization: per-byte comparisons are not always combined #117853

purplesyringa commented Nov 27, 2024

RKSimon commented Nov 27, 2024

PhilippRados commented Jan 26, 2025

purplesyringa commented Jan 26, 2025

PhilippRados commented Jan 26, 2025

nikic commented Jan 26, 2025 •

edited

Loading

nikic commented Jan 26, 2025

RKSimon commented Jan 27, 2025

PhilippRados commented Jan 30, 2025 •

edited

Loading

Missed optimization: per-byte comparisons are not always combined #117853

Missed optimization: per-byte comparisons are not always combined #117853

Comments

purplesyringa commented Nov 27, 2024

RKSimon commented Nov 27, 2024

PhilippRados commented Jan 26, 2025

purplesyringa commented Jan 26, 2025

PhilippRados commented Jan 26, 2025

nikic commented Jan 26, 2025 • edited Loading

nikic commented Jan 26, 2025

RKSimon commented Jan 27, 2025

PhilippRados commented Jan 30, 2025 • edited Loading

nikic commented Jan 26, 2025 •

edited

Loading

PhilippRados commented Jan 30, 2025 •

edited

Loading