Skip to content

[LV] Maximum VF does not consider scaled reductions #141768

Open
@preames

Description

@preames

Reproducer: https://godbolt.org/z/4xf7c8GMM

It looks like the vectorizer has not yet been updated to consider scaled reductions (a.k.a. multiply-accumulate with extended operands) in the VF selection logic. In this case, if my tracing through the debug output is correct, we consider the widest type in the loop to be an i32 and select a maximum VF to cost based on that. This results in a loop which is running at 1/4 of the width it should be. It's still more profitable than not using the zvqdotq (scaled reduction) lowering, but also isn't ideal.

int doti32_i8_sext(char *a, char *b, int N) {
  int sum = 0;
  for (int i = 0; i < N; i++) {
    int a32 = a[i];
    int b32 = b[i];
    sum += a32 * b32;
  }
  return sum;
}

clang --target=riscv64 -march=rv64gcv_zvqdotq0p0 -menable-experimental-extensions dot.c -S -o - -O3

# Relevant Loop Only
.LBB0_5:
        vsetvli a5, zero, e8, mf2, ta, ma
        vle8.v  v9, (a3)
        vle8.v  v10, (a4)
        add     a4, a4, t0
        vsetvli a5, zero, e32, mf2, ta, ma
        vqdotu.vv       v8, v10, v9
        add     a3, a3, t0
        bne     a4, a7, .LBB0_5

A better result would be:

	vsetvli	a3, zero, e32, m2, ta, ma
        ....
.LBB1_5:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vl2r.v	v10, (a3)
	vl2r.v	v12, (a4)
	add	a4, a4, a5
	vqdotu.vv	v8, v12, v10
	add	a3, a3, a5
	bne	a4, a7, .LBB1_5

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions