Skip to content

proposal: simd: new package for intrinsics #67520

Open
@Clement-Jean

Description

@Clement-Jean

Background

After doing a little bit of research on previous issues mentioning adding SIMD to Go, I found the following proposals:

#58610: This proposal relies on the ispmd keyword which does not follow the compatibility guarantees for Go 1.

#53171: This proposal is closer to what I am proposing, however, there are multiple packages per CPU architecture, and I'm providing a more complete explanation on how to achieve adding SIMD to the Go standard library.

#64634: This proposal is more specific to the crypto package but in essence this is close to what I'm proposing here.

Goals

The main goal of this proposal is to provide an alternative approach to designing a simd package for the Go standard library. As of right now, there is no consensus on what the API should look like and this proposal intends to drive the discussion further.

This proposal mainly includes two things:

  • Adding a new kind of build tag that lets the user specify which SIMD ISA to use at compile time.
  • Using compiler intrinsics to generate inline SIMD instructions in the code.

Build Tag

For the first point, I was thinking about optional build tags like the following:

//go:simd sse2
//go:simd neon
//go:simd avx512

etc.

As mentioned, these build tags are optional. If not specified, we would resolve to the appropriate SIMD ISA available on the current OS and ARCH. However, I still think that these build tags are needed for deeper optimization and platform-specific operations. If we know that some instruction is more performant or only available on a certain architecture, we should be able to enforce using it manually.

Finally, having the optional build tag would let the developer choose, at compile time, which SIMD ISA to target and thus cross compile. We could write something similar to:

$ go build -simd neon ...

With this, we could take advantage of platform-specific features and know at compile time the size of the vector registers (e.g., 128, 256, or 512 bits). This would help us make better decisions for optimizations on the compiler side.

Compiler Intrinsics

The next crucial step would be to create a portable SIMD package that would rely on the compiler to generate SIMD instructions through compiler intrinsics. I demonstrated that this is feasible with a POC. As of right now, it looks like the following (you can see more examples here, including UTF8 string validation):

package main

import (
    "fmt"
    "simd"
)

func main() {
    a := simd.Uint8x16{...}
    b := simd.Uint8x16{...}
    c := simd.AddU8x16(a, b)

    fmt.Printf("%v\n", c)
}

And here, the AddU8x16 gets lowered down to a vector add instruction after SSA lowering.

Notes

  • We can provide functions like AddU8x16, AddU8x32, etc. without changing the generics implementation. Other implementations like Highway, Rust std::simd, and Zig @Vector rely on generics for the API. In Go, we do not have non-type parameters in generics; thus we cannot have something like Simd[uint8, 16].

  • Also, we do not have a compile time Sizeof which could help us have:

    type Simd[T SupportedSimdTypes] = [VectorRegisterSize / SizeofInBits(T)]T

Philosophy

It is important to understand that this proposal is not describing an abstraction of SIMD features. Such an abstraction could create noticeable performance difference between ISAs. That's why we are trying to avoid it. This means that if an operation is not available on an ISA, we simply don't provide it. Each intrinsic should only have 1 underlying instruction, not a sequence of instructions.

Challenges to Overcome

  • Under the hood, the current POC, works with pointers on arrays. The main reason is that fixed-size arrays are not currently SSAable. But because we work with these pointers on arrays which are stored in general purpose registers, the performance is not great (allocations required and load of non-contiguous memory) and it requires us to do the LD/ST dance for each function in the simd package.

I believe we would need some kind of type aliases like the following:

type Int8x16 [16]int8
type Uint8x16 [16]uint8
//...

These types should not be indexable and only instantiable through init functions like Splat8x16, Load8x16, etc. The compiler would then promote these special types to vector registers. This would remove all the LD/ST dance and memory allocation that I have in my POC and thus make everything a lot faster.

In the end the previous code snippet could look like this:

package main

import (
    "fmt"
    "simd"
)

func main() {
    a := simd.SplatU8x16(1)
    b := simd.LoadU8x16([16]uint8{...})
    c := simd.AddU8x16(a, b)

    fmt.Printf("%v\n", c)
}
  • A lot of instructions on arm (and I suppose on other ISAs) are missing. The current POC is using constants (WORD $0x...) to encode these.

I believe we should avoid having to use constants when defining intrinsics. We could implement the missing instructions along the way with the implementation of intrinsics.

  • Naming these intrinsics is not always easy. For example, NEON has instructions called VMIN and UMINV. The former returns a vector of min elements, and the latter reduce to the minimum in a given vector. As we don't have function overloads, we will need to find a way to name them appropriately.

I believe we should try to make the horizontal operations more verbose (e.g. ReduceMin8x16) and promote the vertical ones (e.g. Min8x16). For the example of VMIN and UMINV, the latter does not even seem to exist in SSE2 whereas the first one does.

There are other operations that have "conflicting" names. For example shift right and shift left have both logical and arithmetic shifts. For these cases, I believe we could just call them LogicalShiftRight, ArithmeticShiftRight, ... I agree that this is verbose but it makes it clear what is happening.

  • The current POC did not implement the concept of Masks. This is an important concept but also a tricky one to implement without proper compiler support. After discussion with Jan Wassenberg (author of Highway), I realized that some platforms do not treat masks in the same way. Here is a summary:

    • on NEON, SSE4, and AVX2: 1 bit per bit of the vector.
    • on SVE: 1 bit per byte of vector (variable vector size).
    • on AVX-512, and RVV: 1 bit per lane.
    • AVX-512 has a separate register file for masks.

We could have other types aliases, like the one made for vectors. These types would have different shapes and be loaded differently depending on the platform.

  • Some operations need parameters that are known at compile time and within a certain range. For example SSHR (shift right) on NEON takes an immediate n that need to be restricted between 1 and 8 (see vshr_n_s8). I ran into some problems where during the build of the compiler, the n would resolve to 0 (default value of int passed as param) and it would crash the program.

I believe we could have some way to check at compile time these values are within a certain range. Like a static_assert in C++ or checks on the AST.

Why is it Important?

Without SIMD, we are missing on a lot of potential optimizations. Here is a non-exhaustive list of concrete things that could improve performance in daily life scenarios:

Furthermore, it would make these currently existing packages more portable and maintainable:

There are obviously way more applications of SIMD but I am just trying to say that this is useful in practical scenarios.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Proposalcompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    No status

    Status

    Incoming

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions