Skip to content

Consider compiling SIMD-from-SIMD initialisation directly to a shuffle #18148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
huonw opened this issue Oct 19, 2014 · 4 comments
Closed

Consider compiling SIMD-from-SIMD initialisation directly to a shuffle #18148

huonw opened this issue Oct 19, 2014 · 4 comments
Labels
A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@huonw
Copy link
Member

huonw commented Oct 19, 2014

#![crate_type = "lib"]
#![feature(tuple_indexing)]

use std::simd::f32x4;

pub fn foo(x: f32x4) -> f32x4 {
    f32x4(x.0, x.2, x.3, x.1)
}

becomes, with no optimisations:

define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %sret_slot = alloca <4 x float>
  %x = alloca <4 x float>
  store <4 x float> %0, <4 x float>* %x
  %1 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 0
  %2 = getelementptr inbounds <4 x float>* %x, i32 0, i32 0
  %3 = load float* %2
  store float %3, float* %1
  %4 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 1
  %5 = getelementptr inbounds <4 x float>* %x, i32 0, i32 2
  %6 = load float* %5
  store float %6, float* %4
  %7 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 2
  %8 = getelementptr inbounds <4 x float>* %x, i32 0, i32 3
  %9 = load float* %8
  store float %9, float* %7
  %10 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 3
  %11 = getelementptr inbounds <4 x float>* %x, i32 0, i32 1
  %12 = load float* %11
  store float %12, float* %10
  %13 = load <4 x float>* %sret_slot
  ret <4 x float> %13
}

with optimisations it becomes

define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %sret_slot.12.vec.insert = shufflevector <4 x float> %0, <4 x float> undef, <4 x i32> <i32 0, i32 2, i32 3, i32 1>
  ret <4 x float> %sret_slot.12.vec.insert
}

We could detect when a SIMD vector is being created directly from elements of another (pair of*) SIMD vector(s) and convert it directly into the appropriate shuffle instruction. This will save allocas and LLVM doing work, and probably guarantees it more than LLVM currently does. This should even work for vectors of different lengths, as long as the elements are the same.

(This is just a bug since it's an implementation detail.)

*shufflevector actually takes two operands, so f32x2(x.0, y.0, x.1, y.1) can also directly become a shuffle.

@eddyb
Copy link
Member

eddyb commented Oct 19, 2014

If we would stop using allocas in that case, it would be much easier for LLVM to combine extractelement and insertelement into shufflevector.
I have somewhat of a plan for that, and @luqmana has shown interest in helping, but I can't promise anything concrete (I've wanted to do it for more than half an year now, with #12424 and #13004 as my first attempts).

@steveklabnik
Copy link
Member

Two things:

  1. simd has been moved out, and @huonw, you've been doing a bunch of work on stuff, so I don't know how much of a bug this still is
  2. the unoptimized version seems to remove one getelementptr now:
define <4 x float> @_ZN3foo20h88e8b14ab8916c70faaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %dropflag_hint_7 = alloca i8
  %x = alloca <4 x float>
  store i8 61, i8* %dropflag_hint_7
  store <4 x float> %0, <4 x float>* %x, align 16
  %1 = getelementptr inbounds <4 x float>, <4 x float>* %x, i32 0, i32 0
  %2 = load float, float* %1, align 4
  %3 = insertelement <4 x float> undef, float %2, i64 0
  %4 = getelementptr inbounds <4 x float>, <4 x float>* %x, i32 0, i32 2
  %5 = load float, float* %4, align 4
  %6 = insertelement <4 x float> %3, float %5, i64 1
  %7 = getelementptr inbounds <4 x float>, <4 x float>* %x, i32 0, i32 3
  %8 = load float, float* %7, align 4
  %9 = insertelement <4 x float> %6, float %8, i64 2
  %10 = getelementptr inbounds <4 x float>, <4 x float>* %x, i32 0, i32 1
  %11 = load float, float* %10, align 4
  %12 = insertelement <4 x float> %9, float %11, i64 3
  ret <4 x float> %12
}

updated code:

#![crate_type = "lib"]
#![feature(core_simd)]

use std::simd::f32x4;

pub fn foo(x: f32x4) -> f32x4 {
        f32x4(x.0, x.2, x.3, x.1)
}

@huonw
Copy link
Member Author

huonw commented Oct 27, 2015

simd has been moved out, and @huonw, you've been doing a bunch of work on stuff, so I don't know how much of a bug this still is

This is a code-generation thing, so external SIMD would still benefit.

the unoptimized version seems to remove one getelementptr now:

There's still 4 that could theoretically be removed.

(Thanks for prompting for updates!)

@brson brson added I-wishlist T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 23, 2017
@brson
Copy link
Contributor

brson commented Mar 23, 2017

Closing wishlist.

@brson brson closed this as completed Mar 23, 2017
lnicola pushed a commit to lnicola/rust that referenced this issue Sep 25, 2024
fix: When checking for forbidden expr kind matches, account for rawness

An expression starting with `r#const` etc. should be accepted even in edition <=2021.

Fixes rust-lang#18148.

This was not fixed when testing with edition 2024, I wonder whether that means our check for edition is incorrect...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

4 participants