proposal: unsafe: inline assembly with unsafe.Asm function

# Proposal: inline assembly

Author: Iskander Sharipov

_With input from Ilya Tocar._

Last updated: 9 August, 2018

## Abstract

This proposal describes how inline assembly feature can be integrated into
Go language in a backwards-compatible way and without any syntax extensions.

Users that do not write/maintain assembly or not interested in raw clock
performance would not see any difference.

## Background

Right now the only way to get high performance for CPU-bound operation is to
write an assembly implementation using latest instructions available (with appropriate
run time CPU flags switch with fallbacks to something more conservative).

Sometimes the performance advantages of assembly version are astonishing,
for functions like `bytes.IndexByte` it's orders of magnitude improvement:

```
name            old time/op    new time/op     delta
IndexByte/32-8    32.2ns ± 0%      4.1ns ± 0%    -87.14%  (p=0.000 n=9+10)
IndexByte/4K-8    2.43µs ± 0%     0.08µs ± 2%    -96.55%  (p=0.000 n=10+10)

name            old speed      new speed       delta
IndexByte/32-8   993MB/s ± 0%   7724MB/s ± 0%   +677.74%  (p=0.000 n=9+9)
IndexByte/4K-8  1.68GB/s ± 0%  48.80GB/s ± 2%  +2801.13%  (p=0.000 n=10+10)
```

The `old` is portable pure Go version and `new` is assembly code with `AVX2`.

Other cases are addressed with increasing amount of intrinsified functions.
The downside is that they pollute the compiler and speedup only a finite
set of intrinsified functions. Not a general enough solution.

> When referring to intrinsics, functions like `math.Sqrt` are implied.

The advantage of Go intrinsics is that they can be inlined, unlike
manually written assembly functions. This leads to conclusion: what if
there was a way to describe ordinary Go function (hence, inlineable) that
does use machine instructions explicitly? This can address all problems described above:

* It's scalable. Users may define their own intrinsics if they really need to.
* No need to clutter the compiler internals with intrinsic definitions, they
   can be defined as a normal functions inside Go sources.
   This reduces the burden from the Go compiler maintainers.
* Writing these functions is less error-prone than writing hundreds lines of
   assembly code. Also easier to maintain and test.
* It makes inlineable assembler feature request fulfilled, like [issue17373](https://github.com/golang/go/issues/17373) and [issue4978](https://github.com/golang/go/issues/4978).

This proposal describes how to indroduce that facility into the language without
breaking changes and as unintrusive as possible.

## Proposal

This document proposes a single new Go function, `unsafe.Asm` defined as:

```go
func Asm(opcode string, dst interface{}, args ...interface{})
```

This function is the low level mechanism for Go programmers to inject
machine-dependent code right into the function body at the `unsafe.Asm` call site.

For example, this line of code results in a single `MOVQ AX, $10` instruction:

```go
unsafe.Asm("MOVQ", "AX", 10)
```

It can be used to build more high-level, intrinsic-like API.
The best part is that it can be implemented as a third-party library.

Like other arch-dependent code, `unsafe.Asm` should be protected by a build
tag or appropriate filename suffix, like `_amd64`.

`unsafe` package is preferable, because:

1. Inline assembly, just like normal assembly, is unsafe.
2. `unsafe.Pointer` can be useful when dealing with memory operands.
3. It does explicitly state that it may not be as backwards-compatible as
   other Go packages.

### unsafe.Asm arguments

`opcode` refers to the instruction name for the host machine.
All opcodes are in Go assembler syntax and require size suffixes.
It's also possible to pass opcode suffixes along with instruction name.
These suffixes should be separated by period, just like in ordinary Go asm.

`dst` accepts any assignable Go values, with exception of compound expressions
like index expression and function calls that return a pointer. One can use
temporary variables and/or address taking to overcome this limitation. 

`args` are more permissive than `dst` and also accept integer and floating-point
constants for immediates as well as more complex Go expressions that yield
value that is permitted for `unsafe.Asm` arguments.

The permitted values include all numeric types sans complex numbers.
Value must fit the hardware register, so it matches the `unsafe.Sizeof(int)`.
For 32-bit platforms, 64-bit types can't be used.
For all other values pointers should be used.

Pointer types (including `unsafe.Pointer`) force memory operand interpretation.
Non-pointer types follow default Go value semantics.

```go
var x int64
unsafe.Asm("MOVQ", x, 10)  // MOVQ x(SP), AX; MOVQ $10, AX
unsafe.Asm("MOVQ", &x, 10) // LEAQ x(SP), AX; MOVQ $10, (AX)
```

Note that dst/src order follows Go conventions, not assembly language convention:
destination goes first, then sources. This also helps to make destination
parameter more distinguishable inside `unsafe.Asm` signature.

As a special case, instructions that have no explicit arguments use `nil` destination:

```go
unsafe.Asm("SFENCE", nil)
```

Comparison-like instructions that usually used to update flags and do not have
explicit destination also use `nil` destination argument:

```go
// Compare `x` with 1; updates flags.
unsafe.Asm("CMPQ", nil, 1, x)
```

See [Efficient control flow](#efficient-control-flow) for more details.

### Guarantees

It is important to clearly describe guarantees that programmer may rely on.

* The order of `unsafe.Asm` is determenistic,
  these calls can't be sheduled somewhere else.
  This means that a sequence of `unsafe.Asm` is executed in order they
  appear inside source code.
* CPU flags are preserved between `unsafe.Asm` calls and `unsafe.Asm` itself
  is marked as flag clobbering operation.
* Explicitly allocated registers are not clobbered by the Go compiler.

### Efficient control flow

There is no `JMP` support because inlined assembler does not see Go labels.

In order to make writing efficient programs possible,
SSA backends can recognize this operation sequence and produce optimal code:

```go
var found bool                        // 1. Some bool variable.
unsafe.Asm("VPTEST", nil, "Y3", "Y3") // 2. Some flag-generating operation.
unsafe.Asm("SETNE", found)            // 3. Flags assignment to bool variable.
if found {                            // 4. Branching using that bool variable.
	// Body to be executed (hint: can use goto to Go label here).
}
```

`SETNE` can be eliminated as well as `found` variable read.
Generated machine code becomes close to one that is produced out of hand-written assembly.

### Error reporting

There are different kinds of programming errors that may occur during
`unsafe.Asm` usage.

Go compiler frontend, `gc`, can catch invalid opcodes and obviously
wrong operand types. For example, `JAVA` opcode does not exist and will
result in compile-time error triggered from `gc`. Operands
are checked using generic rules that are shared among all instructions.

Most other errors are generated by assembler backends.
For `AMD64` such backend is `cmd/internal/obj/x86`.

This is the direct consequence of opaqueness of the asm ops during compilation.
That property reduces the amount of code needed to implement inline assembly,
but does delay error reporting, leading to somewhat more cryptic error messages.
In turn, this may be a good opportunity to imporve assembler error reporting.

### Example

Given `math.Trunc` intrinsified function, we can try to define `AMD64` version
without direct compiler support.

```go
package example

import (
	"math"
	"unsafe"
)

func trunc1(x float64) float64 {
	return math.Trunc(x)
}

func trunc2(x float64) float64 {
	unsafe.Asm("ROUNDSD", x, 3, x)
	return x
}
```

`trunc1` and `trunc2` generate same code sequence:

```
MOVSD	x(SP), X0
ROUNDSD	$3, X0, X0
MOVSD	X0, ret+(SP)
```

The only difference is that `trunc1` does `runtime.support_sse41` check
which can be done inside `trunc2` as well.

## Compatibility

The API changes are fully backwards compatible.

## Implementation

Most of the work would be done by the author of this proposal.

Initial implementation will include `AMD64` support for `unsafe.Asm` code generation.<br>
Other backends can adopt that implementation ideas to add missing architectures support.

Go parts that need modifications:

* `unsafe`: new function, `Asm`
* `cmd/compile/internal/gc`: `unsafe.Asm` typechecking and SSA generation
* `cmd/compile/internal/ssa`: changes to `regalloc` plus new asm-related ops
* `cmd/compile/internal/amd64`: code generation for `unsafe.Asm`-generated ops
* `cmd/asm/internal`: parser is used to parse asm operand strings

## Additional notes

Initial implementation prototype gives **85-100%** of hand-written assembly code performance.
There is some room for improvements, especially for the memory operations, which
can bump lower bound closer to **90-95%**. The remaining performance difference is mostly
due to advanced branching tricks used in some assembly code and more efficient
code layout/registers usage.

## Open questions

### How to express write-only destination operands to avoid extra zeroing?

**Proposed solution**: ?

### What about gccgo and other Go implementations?

**Proposed solution**: we can probably start by not permitting `unsafe.Asm` inside compilers that do not support it.

### How to express multi-output instructions?

**Proposed solution A**: interpret `[]interface{}` argument as a multi-value destination.

```go
var quo, rem uint8
// Note that IDIV expects first operand to be in AX.
unsafe.Asm("MOVB", "AX", uint8(x))
unsafe.Asm("IDIV", []interface{}{quo, rem}, uint8(y))
// AL is moved to quo.
// AH is moved to rem.
``` 

Note that `[]interface{}` causes no allocations and is consumed during the compile time.<br>
This is consistent with a way how `unsafe.Sizeof` works.

**Proposed solution B**: add `unsafe.Asm2` function that has 2 destination arguments.

```go
func Asm2(opcode string, dst1, dst2 interface{}, args ...interface{})
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal: unsafe: inline assembly with unsafe.Asm function #26891

Proposal: inline assembly

Abstract

Background

Proposal

unsafe.Asm arguments

Guarantees

Efficient control flow

Error reporting

Example

Compatibility

Implementation

Additional notes

Open questions

How to express write-only destination operands to avoid extra zeroing?

What about gccgo and other Go implementations?

How to express multi-output instructions?

21 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

proposal: unsafe: inline assembly with unsafe.Asm function #26891

Description

Proposal: inline assembly

Abstract

Background

Proposal

unsafe.Asm arguments

Guarantees

Efficient control flow

Error reporting

Example

Compatibility

Implementation

Additional notes

Open questions

How to express write-only destination operands to avoid extra zeroing?

What about gccgo and other Go implementations?

How to express multi-output instructions?

Activity

cznic commented on Aug 9, 2018

ghost commented on Aug 9, 2018

cznic commented on Aug 9, 2018

quasilyte commented on Aug 9, 2018

quasilyte commented on Aug 9, 2018

as commented on Aug 9, 2018

quasilyte commented on Aug 9, 2018

as commented on Aug 9, 2018

billotosyr commented on Aug 9, 2018

quasilyte commented on Aug 9, 2018

as commented on Aug 9, 2018

TocarIP commented on Aug 9, 2018

quasilyte commented on Aug 9, 2018

21 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions