Skip to content

LLVM pointer range loop / autovectorization regression #35662

Closed
@bluss

Description

@bluss
Member

The benchmarks in crate matrixmultiply version 0.1.8 degrade with MIR enabled. (commit bluss/matrixmultiply@3d83647)

Tested using rustc 1.12.0-nightly (1deb02ea6 2016-08-12).

Typical output:

// -C target-cpu=native
test mat_mul_f32::m127             ... bench:   2,703,773 ns/iter (+/- 636,432)
// -Z orbit=off -C target-cpu=native
test mat_mul_f32::m127             ... bench:     648,817 ns/iter (+/- 22,379)

Sure, the matrix multiplication kernel uses some major muckery that it expects the compiler to optimize down and autovectorize, but since it technically is a regression, it gets a report.

Activity

pmarcelll

pmarcelll commented on Aug 14, 2016

@pmarcelll
Contributor

This might be caused by the LLVM upgrade.

rustc 1.12.0-nightly (7333c4a 2016-07-31) (before the LLVM upgrade):

// -C target-cpu=native
test mat_mul_f32::m127             ... bench:     162,334 ns/iter (+/- 16,429)
// -Z orbit=on -C target-cpu=native
test mat_mul_f32::m127             ... bench:     162,855 ns/iter (+/- 7,302)

rustc 1.12.0-nightly (28ce3e8 2016-08-01) (after the LLVM upgrade, old-trans is still the default):

// -C target-cpu=native
test mat_mul_f32::m127             ... bench:     169,562 ns/iter (+/- 11,118)
// -Z orbit=on -C target-cpu=native
test mat_mul_f32::m127             ... bench:     570,056 ns/iter (+/- 34,4)

Measured on an Intel Haswell processor.

bluss

bluss commented on Aug 14, 2016

@bluss
MemberAuthor

Thank you, great info.

added
T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.
on Aug 14, 2016
changed the title [-]MIR autovectorization regression[/-] [+]LLVM autovectorization regression[/+] on Aug 14, 2016
eddyb

eddyb commented on Aug 14, 2016

@eddyb
Member

Thanks to @dikaiosune I produced this minimization (for a different case): https://godbolt.org/g/3Nuofl. Seems to reproduce with clang 3.8 vs running LLVM 3.9's opt -O3.

eddyb

eddyb commented on Aug 14, 2016

@eddyb
Member
changed the title [-]LLVM autovectorization regression[/-] [+]LLVM/MIR autovectorization regression[/+] on Aug 14, 2016
eddyb

eddyb commented on Aug 14, 2016

@eddyb
Member

@pmarcelll I misread some of those reports, so it's a combination of LLVM 3.9 and MIR trans being used?

On a Haswell, -C target-cpu=native should imply sse4.2 AFAIK, so I'm not sure it's the linked issue for the original problem being observed here, only @dikaiosune's.

added
A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.
A-MIRArea: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html
on Aug 14, 2016
bluss

bluss commented on Aug 15, 2016

@bluss
MemberAuthor

@eddyb Haswell has avx + avx2 too doesn't it, so it should imply those too

pmarcelll

pmarcelll commented on Aug 15, 2016

@pmarcelll
Contributor

If my benchmarking is correct and i386 means no SIMD at all, then it's not just an autovectorization regression.

rustc 1.12.0-nightly (7333c4a 2016-07-31):

// -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     209,399 ns/iter (+/- 13,055)
// -Z orbit=on -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     205,028 ns/iter (+/- 11,386)

rustc 1.12.0-nightly (28ce3e8 2016-08-01):

// -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     205,863 ns/iter (+/- 14,618)
// -Z orbit=on -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     512,995 ns/iter (+/- 14,309)

The last one is especially interesting because the i386 version is faster than the haswell version (512,995 ns/iter vs. 570,056 ns/iter).

EDIT: same on the latest nightly.

30 remaining items

arielb1

arielb1 commented on Aug 21, 2016

@arielb1
Contributor

Problem code:

#![feature(test)]
extern crate test;
use test::Bencher;

pub type T = f32;

const MR: usize = 4;
const NR: usize = 4;

macro_rules! loop4 {
    ($i:ident, $e:expr) => {{
        let $i = 0; $e;
        let $i = 1; $e;
        let $i = 2; $e;
        let $i = 3; $e;
    }}
}

/// 4x4 matrix multiplication kernel
///
/// This does the matrix multiplication:
///
/// C ← α A B
///
/// + k: length of data in a, b
/// + a, b are packed
/// + c has general strides
/// + rsc: row stride of c
/// + csc: col stride of c
#[inline(never)]
pub unsafe fn kernel(k: usize, alpha: T, a: *const T, b: *const T,
                     c: *mut T, rsc: isize, csc: isize)
{
    let mut ab = [[0.; NR]; MR];
    let mut a = a;
    let mut b = b;

    // Compute matrix multiplication into ab[i][j]
    for _ in 0..k {
        let v0: [_; MR] = [at(a, 0), at(a, 1), at(a, 2), at(a, 3)];
        let v1: [_; NR] = [at(b, 0), at(b, 1), at(b, 2), at(b, 3)];
        loop4!(i, loop4!(j, ab[i][j] += v0[i] * v1[j]));

        a = a.offset(MR as isize);
        b = b.offset(NR as isize);
    }

    macro_rules! c {
        ($i:expr, $j:expr) => (*c.offset(rsc * $i as isize + csc * $j as isize));
    }

    // set C = α A B
    for i in 0..MR {
        for j in 0..NR {
            c![i, j] = alpha * ab[i][j];
        }
    }
}

#[inline(always)]
unsafe fn at(ptr: *const T, i: usize) -> T {
    *ptr.offset(i as isize)
}

#[test]
fn test_gemm_kernel() {
    let k = 4;
    let mut a = [1.; 16];
    let mut b = [0.; 16];
    for (i, x) in a.iter_mut().enumerate() {
        *x = i as f32;
    }

    for i in 0..4 {
        b[i + i * 4] = 1.;
    }
    let mut c = [0.; 16];
    unsafe {
        kernel(k, 1., &a[0], &b[0], &mut c[0], 1, 4);
        // col major C
    }
    assert_eq!(&a, &c);
}

#[bench]
fn bench_gemm(bench: &mut Bencher) {
    const K: usize = 32;
    let mut a = [1.; MR * K];
    let mut b = [0.; NR * K];
    for (i, x) in a.iter_mut().enumerate() {
        *x = i as f32;
    }

    for i in 0..NR {
        b[i + i * K] = 1.;
    }
    let mut c = [0.; NR * MR];
    bench.iter(|| {
        unsafe {
            kernel(K, 1., &a[0], &b[0], &mut c[0], 1, 4);
        }
        c
    });
}
eddyb

eddyb commented on Aug 21, 2016

@eddyb
Member

@arielb1 The root problem can be seen in the IR generated by #35662 (comment) - which is left with a constant-length memset that isn't removed due to pass ordering problems. Solving that should help the more complex matrix multiplication code.

brson

brson commented on Aug 25, 2016

@brson
Contributor

Is this fixed after #35740?

Edit: Seems no.

added
I-slowIssue: Problems and improvements with respect to performance of generated code.
on Aug 25, 2016
eddyb

eddyb commented on Aug 26, 2016

@eddyb
Member

I've experimented with this change to LLVM:

diff --git a/lib/Transforms/IPO/PassManagerBuilder.cpp b/lib/Transforms/IPO/PassManagerBuilder.cpp
index df6a48e..da420f3 100644
--- a/lib/Transforms/IPO/PassManagerBuilder.cpp
+++ b/lib/Transforms/IPO/PassManagerBuilder.cpp
@@ -317,6 +317,9 @@ void PassManagerBuilder::addFunctionSimplificationPasses(
   // Run instcombine after redundancy elimination to exploit opportunities
   // opened up by them.
   addInstructionCombiningPass(MPM);
+  if (OptLevel > 1) {
+    MPM.add(createGVNPass(DisableGVNLoadPRE));  // Remove redundancies
+  }
   addExtensionsToPM(EP_Peephole, MPM);
   MPM.add(createJumpThreadingPass());         // Thread jumps
   MPM.add(createCorrelatedValuePropagationPass());

It seems to result in the constant-length memset being removed in the simpler cases.

However, I ended up re-doing the reduction and ended up with something similar.
That testcase does 2ns with old trans (beta) and 9ns with MIR trans + the modified LLVM.
The only real difference is the nesting of the ab array, which optimizes really poorly.

eddyb

eddyb commented on Aug 29, 2016

@eddyb
Member

I found the remaining problem: initializing the arrays right now uses < (UGT) while our iterators and C++ use != (NE) for the stop condition of the pointer.
Fixing that and running GVN twice fixes the performance of @bluss' benchmark.

nikomatsakis

nikomatsakis commented on Aug 31, 2016

@nikomatsakis
Contributor

I found the remaining problem: initializing the arrays right now uses < (UGT) while our iterators and C++ use != (NE) for the stop condition of the pointer.

This is so beautifully fragile.

eddyb

eddyb commented on Aug 31, 2016

@eddyb
Member

@nikomatsakis See #36124 (comment) for a quick explanation of why LLVM's reluctance is correct in general (even though it has enough information to optimize nested < loops working on nested local arrays)

bluss

bluss commented on Oct 19, 2016

@bluss
MemberAuthor

More or less reopened this issue as #37276. It's not affecting matrixmultiply because I think the uninitialized + assignments workaround is sound (until they take uninitialized away from us).

This issue is left closed since it did end up finding & fixing a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.A-MIRArea: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.htmlI-slowIssue: Problems and improvements with respect to performance of generated code.P-highHigh priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.regression-from-stable-to-betaPerformance or correctness regression from stable to beta.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @eddyb@brson@nikomatsakis@arielb1@pmarcelll

      Issue actions

        LLVM pointer range loop / autovectorization regression · Issue #35662 · rust-lang/rust