Description
Proposal: inline assembly
Author: Iskander Sharipov
With input from Ilya Tocar.
Last updated: 9 August, 2018
Abstract
This proposal describes how inline assembly feature can be integrated into
Go language in a backwards-compatible way and without any syntax extensions.
Users that do not write/maintain assembly or not interested in raw clock
performance would not see any difference.
Background
Right now the only way to get high performance for CPU-bound operation is to
write an assembly implementation using latest instructions available (with appropriate
run time CPU flags switch with fallbacks to something more conservative).
Sometimes the performance advantages of assembly version are astonishing,
for functions like bytes.IndexByte
it's orders of magnitude improvement:
name old time/op new time/op delta
IndexByte/32-8 32.2ns ± 0% 4.1ns ± 0% -87.14% (p=0.000 n=9+10)
IndexByte/4K-8 2.43µs ± 0% 0.08µs ± 2% -96.55% (p=0.000 n=10+10)
name old speed new speed delta
IndexByte/32-8 993MB/s ± 0% 7724MB/s ± 0% +677.74% (p=0.000 n=9+9)
IndexByte/4K-8 1.68GB/s ± 0% 48.80GB/s ± 2% +2801.13% (p=0.000 n=10+10)
The old
is portable pure Go version and new
is assembly code with AVX2
.
Other cases are addressed with increasing amount of intrinsified functions.
The downside is that they pollute the compiler and speedup only a finite
set of intrinsified functions. Not a general enough solution.
When referring to intrinsics, functions like
math.Sqrt
are implied.
The advantage of Go intrinsics is that they can be inlined, unlike
manually written assembly functions. This leads to conclusion: what if
there was a way to describe ordinary Go function (hence, inlineable) that
does use machine instructions explicitly? This can address all problems described above:
- It's scalable. Users may define their own intrinsics if they really need to.
- No need to clutter the compiler internals with intrinsic definitions, they
can be defined as a normal functions inside Go sources.
This reduces the burden from the Go compiler maintainers. - Writing these functions is less error-prone than writing hundreds lines of
assembly code. Also easier to maintain and test. - It makes inlineable assembler feature request fulfilled, like issue17373 and issue4978.
This proposal describes how to indroduce that facility into the language without
breaking changes and as unintrusive as possible.
Proposal
This document proposes a single new Go function, unsafe.Asm
defined as:
func Asm(opcode string, dst interface{}, args ...interface{})
This function is the low level mechanism for Go programmers to inject
machine-dependent code right into the function body at the unsafe.Asm
call site.
For example, this line of code results in a single MOVQ AX, $10
instruction:
unsafe.Asm("MOVQ", "AX", 10)
It can be used to build more high-level, intrinsic-like API.
The best part is that it can be implemented as a third-party library.
Like other arch-dependent code, unsafe.Asm
should be protected by a build
tag or appropriate filename suffix, like _amd64
.
unsafe
package is preferable, because:
- Inline assembly, just like normal assembly, is unsafe.
unsafe.Pointer
can be useful when dealing with memory operands.- It does explicitly state that it may not be as backwards-compatible as
other Go packages.
unsafe.Asm arguments
opcode
refers to the instruction name for the host machine.
All opcodes are in Go assembler syntax and require size suffixes.
It's also possible to pass opcode suffixes along with instruction name.
These suffixes should be separated by period, just like in ordinary Go asm.
dst
accepts any assignable Go values, with exception of compound expressions
like index expression and function calls that return a pointer. One can use
temporary variables and/or address taking to overcome this limitation.
args
are more permissive than dst
and also accept integer and floating-point
constants for immediates as well as more complex Go expressions that yield
value that is permitted for unsafe.Asm
arguments.
The permitted values include all numeric types sans complex numbers.
Value must fit the hardware register, so it matches the unsafe.Sizeof(int)
.
For 32-bit platforms, 64-bit types can't be used.
For all other values pointers should be used.
Pointer types (including unsafe.Pointer
) force memory operand interpretation.
Non-pointer types follow default Go value semantics.
var x int64
unsafe.Asm("MOVQ", x, 10) // MOVQ x(SP), AX; MOVQ $10, AX
unsafe.Asm("MOVQ", &x, 10) // LEAQ x(SP), AX; MOVQ $10, (AX)
Note that dst/src order follows Go conventions, not assembly language convention:
destination goes first, then sources. This also helps to make destination
parameter more distinguishable inside unsafe.Asm
signature.
As a special case, instructions that have no explicit arguments use nil
destination:
unsafe.Asm("SFENCE", nil)
Comparison-like instructions that usually used to update flags and do not have
explicit destination also use nil
destination argument:
// Compare `x` with 1; updates flags.
unsafe.Asm("CMPQ", nil, 1, x)
See Efficient control flow for more details.
Guarantees
It is important to clearly describe guarantees that programmer may rely on.
- The order of
unsafe.Asm
is determenistic,
these calls can't be sheduled somewhere else.
This means that a sequence ofunsafe.Asm
is executed in order they
appear inside source code. - CPU flags are preserved between
unsafe.Asm
calls andunsafe.Asm
itself
is marked as flag clobbering operation. - Explicitly allocated registers are not clobbered by the Go compiler.
Efficient control flow
There is no JMP
support because inlined assembler does not see Go labels.
In order to make writing efficient programs possible,
SSA backends can recognize this operation sequence and produce optimal code:
var found bool // 1. Some bool variable.
unsafe.Asm("VPTEST", nil, "Y3", "Y3") // 2. Some flag-generating operation.
unsafe.Asm("SETNE", found) // 3. Flags assignment to bool variable.
if found { // 4. Branching using that bool variable.
// Body to be executed (hint: can use goto to Go label here).
}
SETNE
can be eliminated as well as found
variable read.
Generated machine code becomes close to one that is produced out of hand-written assembly.
Error reporting
There are different kinds of programming errors that may occur during
unsafe.Asm
usage.
Go compiler frontend, gc
, can catch invalid opcodes and obviously
wrong operand types. For example, JAVA
opcode does not exist and will
result in compile-time error triggered from gc
. Operands
are checked using generic rules that are shared among all instructions.
Most other errors are generated by assembler backends.
For AMD64
such backend is cmd/internal/obj/x86
.
This is the direct consequence of opaqueness of the asm ops during compilation.
That property reduces the amount of code needed to implement inline assembly,
but does delay error reporting, leading to somewhat more cryptic error messages.
In turn, this may be a good opportunity to imporve assembler error reporting.
Example
Given math.Trunc
intrinsified function, we can try to define AMD64
version
without direct compiler support.
package example
import (
"math"
"unsafe"
)
func trunc1(x float64) float64 {
return math.Trunc(x)
}
func trunc2(x float64) float64 {
unsafe.Asm("ROUNDSD", x, 3, x)
return x
}
trunc1
and trunc2
generate same code sequence:
MOVSD x(SP), X0
ROUNDSD $3, X0, X0
MOVSD X0, ret+(SP)
The only difference is that trunc1
does runtime.support_sse41
check
which can be done inside trunc2
as well.
Compatibility
The API changes are fully backwards compatible.
Implementation
Most of the work would be done by the author of this proposal.
Initial implementation will include AMD64
support for unsafe.Asm
code generation.
Other backends can adopt that implementation ideas to add missing architectures support.
Go parts that need modifications:
unsafe
: new function,Asm
cmd/compile/internal/gc
:unsafe.Asm
typechecking and SSA generationcmd/compile/internal/ssa
: changes toregalloc
plus new asm-related opscmd/compile/internal/amd64
: code generation forunsafe.Asm
-generated opscmd/asm/internal
: parser is used to parse asm operand strings
Additional notes
Initial implementation prototype gives 85-100% of hand-written assembly code performance.
There is some room for improvements, especially for the memory operations, which
can bump lower bound closer to 90-95%. The remaining performance difference is mostly
due to advanced branching tricks used in some assembly code and more efficient
code layout/registers usage.
Open questions
How to express write-only destination operands to avoid extra zeroing?
Proposed solution: ?
What about gccgo and other Go implementations?
Proposed solution: we can probably start by not permitting unsafe.Asm
inside compilers that do not support it.
How to express multi-output instructions?
Proposed solution A: interpret []interface{}
argument as a multi-value destination.
var quo, rem uint8
// Note that IDIV expects first operand to be in AX.
unsafe.Asm("MOVB", "AX", uint8(x))
unsafe.Asm("IDIV", []interface{}{quo, rem}, uint8(y))
// AL is moved to quo.
// AH is moved to rem.
Note that []interface{}
causes no allocations and is consumed during the compile time.
This is consistent with a way how unsafe.Sizeof
works.
Proposed solution B: add unsafe.Asm2
function that has 2 destination arguments.
func Asm2(opcode string, dst1, dst2 interface{}, args ...interface{})
Activity
cznic commentedon Aug 9, 2018
If considered to be accepted, I think the signature should be
ghost commentedon Aug 9, 2018
@cznic Why should there be a return error value? In what cases would an error be deferred from compile time to run time?
cznic commentedon Aug 9, 2018
Scratch the return value in my post, IDK what I was thinking. What I really wanted to say is that the arguments and all variations of arguments (Asm2, Asm3, ...) should be replaced by just a string. There are more things that are needed in assembler code than just instructions. For example directives, declarations and even comments are sometimes a must have.
quasilyte commentedon Aug 9, 2018
@cznic for single string argument, I have these questions:
dst
argument? There can be 0, 1 or more of them. Without this info, it's impossible to model data flow properly in SSA regalloc.unsafe.Asm("LEAQ", "AX", &a[0])
.Note that most of the time, one can use Go variables without having to specifying registers.
The only notable exception is vector registers like
X/Y/Z
on AMD64. Programmer has to use them directly. For scalars and pointers, there no need to spell registers by names; regalloc will do that for you.quasilyte commentedon Aug 9, 2018
This is out of scope of this proposal.
At least this was my initial goal: make it possible to use SIMD inside Go loops without having to write whole function in asm.
Another important case is getting rid of special treatment of intrinsified functions inside the compiler.
Just use Go comments.
Single
unsafe.Asm
encodes single instruction.as commentedon Aug 9, 2018
How will this proposal ensure that the assembly is correct at compile time rather than run time? Across architectures?
I think containment is extremely useful when dealing with platform-specific code. How does the feature benefit the maintainer of the codebase? It is easy to tell where an assembly function is called, whereas in this scenario it would be difficult to see where it is being used.
I'm confused about the end goal. We would use this inside of loops, so we don't have to use them inside pure assembly functions? I would rather have a function that implements the loop inside of it rather than invoke the instructions within the loop. Are there any other advantages of doing it this way other than convenience for the writer?
quasilyte commentedon Aug 9, 2018
What do you mean by "assembly is correct"?
If you mean correct as in assembly code, just "assembles correctly", then it's the asm backend responsibility. The
unsafe.Asm
produces SSA value that is turned into matchingobj.Prog
object after optimization passes. These are handled by the asm backend as usual.Could you clarify, please?
The
unsafe.Asm
is as portable as normal asm (read: not portable at all). If one wants several implementations inside one loop, it's still possible to wrap a SIMD instruction calls into a function (that function will be inlined, so no performance penalties there).It's possible to write portable 3-rd party library that gives such primitives as cross-platform SIMD operations. The advantage is that they can be inlineable, so this makes them more composable than pure asm alternatives (user always pays for the function call).
Making it possible to get rid of "intrinsics" from the compiler and make it possible to implement them without so much special casing.
as commentedon Aug 9, 2018
For context, this is where it was unclear:
If I have an assembly function that contains an invalid or unsupported instruction, and I run
go build
. I will get an error and no binary will be produced. If the same scenario occurs in this proposal, what will happen when the user runsgo build
?billotosyr commentedon Aug 9, 2018
Inline asm is a bad idea in my opinion. In C/C++ it leads to run-on sections like..
#elif defined(i386)
asm ...
#elif defined(x86_64) || defined(amd64)
asm ...
#elif defined(powerpc) || defined(ppc)
asm ...
#elif defined(s390x)
asm ...
#elif defined(sparc)
asm ...
#elif defined(ia64)
asm ...
You indicated that you can protect the code with a build tag, but that only means users of other architectues won't have access to the code at all. In truth, most of the time the inline asm will only be written for amd64, which will make for huge porting problems to other architectures.
The way things are now, asm is really only used (other than in the go runtime itself) for accellerating code that has already been written in Go. Becuase it's written in Go it's portable. Inline asm will destroy the admirable portability of the Go language.
It also destroys readability.
quasilyte commentedon Aug 9, 2018
All errors happen in the same way,
unsafe.Asm("FOO", nil)
results ininvalid instruction
duringgo build
. Same for invalid arguments.Suppose this is the compilation pipeline:
The
unsafe.Asm
is replaced withOpAsm
SSA value during theFE
->BE
transition (gc/ssa.go
),this catches invalid opcodes.
After BE finishes optimizations and lowering,
BE
->assembler
transformation producesobj.Prog
lists, these are then verified by the asm backends. This catches all other errors like invalid arguments combinations, etc.as commentedon Aug 9, 2018
Does anything prevent a user from separating the opcode from the call string by using a constant, such as:
const myInstruction = "MOVQ"
TocarIP commentedon Aug 9, 2018
@billotosyr you already can write asm-only function without any go fallback, but I don't think this happens now.
quasilyte commentedon Aug 9, 2018
In the prototype I've rolled, no. Any constant string will do.
I believe this property does not make things worse.
The intention is to provide very minimalistic API that makes it possible to write a less error-prone intrinsic-like library as a 3-rd party package. For
MOVQ
, we can have these signatures:The other way is to provide named constants in
github.com/foobar/x86
package:The other benefits came to my mind:
unsafe.Asm
has quite straightforward signature and can be verified for semantics with tools like staticcheck.gc
compiler support, that is.21 remaining items