Question: why is assembly language so heavily used #1968

marxin · 2019-01-17T07:47:05Z

It's question that I noticed when I was reducing #1964.

So If I see correctly the problematic code can be rewritten in C as follows:

typedef long BLASLONG;

void
__attribute__((noipa))
saxpy_kernel_16( BLASLONG n, float * restrict x, float * restrict y, float alpha)
{
  for (BLASLONG i = 0; i < n; i++)
    y[i] += alpha * x[i];
}

int saxpy_k(BLASLONG n, float da, float * restrict x, BLASLONG inc_x, float * restrict y, BLASLONG inc_y)
{
  if (inc_x == 1 && inc_y == 1)
    saxpy_kernel_16 (n, x, y, da);
  else
  {
    for (BLASLONG i = 0; i < n; i++)
      y[i * inc_x] += da * x[i * inc_y];
  }
}

which generates vectorized code for both functions (testing GCC 9):
gcc blast2.c -c -S -O2 -march=haswell -fdump-tree-all-details -ftree-vectorize

	.file	"blast2.c"
	.text
	.p2align 4
	.globl	saxpy_kernel_16
	.type	saxpy_kernel_16, @function
saxpy_kernel_16:
.LFB0:
	.cfi_startproc
	testq	%rdi, %rdi
	jle	.L11
	leaq	-1(%rdi), %rax
	cmpq	$6, %rax
	jbe	.L7
	movq	%rdi, %rcx
	vbroadcastss	%xmm0, %ymm2
	xorl	%eax, %eax
	shrq	$3, %rcx
	salq	$5, %rcx
	.p2align 4,,10
	.p2align 3
.L4:
	vmovups	(%rsi,%rax), %ymm1
	vfmadd213ps	(%rdx,%rax), %ymm2, %ymm1
	vmovups	%ymm1, (%rdx,%rax)
	addq	$32, %rax
	cmpq	%rcx, %rax
	jne	.L4
	movq	%rdi, %rax
	andq	$-8, %rax
	testb	$7, %dil
	je	.L13
	vzeroupper
	.p2align 4,,10
	.p2align 3
.L6:
	vmovss	(%rsi,%rax,4), %xmm1
	vfmadd213ss	(%rdx,%rax,4), %xmm0, %xmm1
	vmovss	%xmm1, (%rdx,%rax,4)
	incq	%rax
	cmpq	%rax, %rdi
	jg	.L6
.L11:
	ret
	.p2align 4,,10
	.p2align 3
.L13:
	vzeroupper
	ret
.L7:
	xorl	%eax, %eax
	jmp	.L6
	.cfi_endproc
.LFE0:
	.size	saxpy_kernel_16, .-saxpy_kernel_16
	.p2align 4
	.globl	saxpy_k
	.type	saxpy_k, @function
saxpy_k:
.LFB1:
	.cfi_startproc
	movq	%rdx, %rax
	movq	%rcx, %rdx
	cmpq	$1, %rax
	jne	.L21
	cmpq	$1, %r8
	je	.L15
.L21:
	testq	%rdi, %rdi
	jle	.L29
	leaq	0(,%rax,4), %rcx
	salq	$2, %r8
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
.L20:
	vmovss	(%rsi), %xmm1
	vfmadd213ss	(%rdx), %xmm0, %xmm1
	incq	%rax
	addq	%r8, %rsi
	vmovss	%xmm1, (%rdx)
	addq	%rcx, %rdx
	cmpq	%rax, %rdi
	jne	.L20
	ret
	.p2align 4,,10
	.p2align 3
.L15:
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	call	saxpy_kernel_16
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.p2align 4,,10
	.p2align 3
.L29:
	ret
	.cfi_endproc
.LFE1:
	.size	saxpy_k, .-saxpy_k
	.ident	"GCC: (GNU) 9.0.0 20190116 (experimental)"
	.section	.note.GNU-stack,"",@progbits

So my question is why have you been using arch-specific assembly code so much?
Thanks

The text was updated successfully, but these errors were encountered:

marxin · 2019-01-17T07:48:08Z

@martin-frbg @peter-bergner @rguenth

martin-frbg · 2019-01-17T08:05:27Z

Most of the code in question was written at a time when gcc 4.x or even 3.x was current, and is also expected to compile with a wide range of compilers and versions. (Not everybody - especially users of compute clusters at research institutions - can be expected to update their compiler, so some are still
stuck with whatever their version of e.g. RHEL contained.) The recent code contributions for Intel Skylake X make heavy use of AVX512 intrinsics instead, but they will simply fall back to the old Haswell kernels if the compiler is not up to this.

rguenth · 2019-01-17T08:06:05Z

On Wed, 16 Jan 2019, marxin wrote: @martin-frbg @peter-bergner @rguenth

I suppose the asm()s are there because in _16 we know that n % 16 == 0? Also eventually micro-arch dependent "optimal" unrolling is performed here. There's now #pragma GCC unroll which triggers unrolling (on RTL) so if you want to unroll the vectorized loop you can do #pragma GCC unroll 4 before the loop. That n % 16 is zero is easiest communicated by iterating up to n' * 16 instead of n. I do expect assembly to be eventually the best performing variants but then I'd expect those to be written in .s files rather than using inline assembly. Using vector intrinsics might be another option. All changes require work of course so fixing up the bad inline asm()s is top priority.

brada4 · 2019-01-17T17:07:22Z

If you manage to fine-tune C code to do better than assembly contribution is more than welcome.
Intel C compiler will plainly ignore most GCC pragmas, clang a bit less.
Also we should not break currently sold/supported LTS OS distributions, e.g. RedHat 5, Ubuntu 12 etc.
The inline factor can be simulated by inlining C statements too, which is also done in some microarch-specific kernels.
You show generated assembly, which sort of has same instructions, but real processors have execution resources that may block for few cycles if instructions are not well interleaved, say CPU can do 4 float multiplications and one memory load at once, so assembly will sequence loads to half of registers, do multiplication on previous set loaded, then pre-load next half, so it never blocks. Same instructions in different sequence will block before multiplication for example.

marxin · 2019-01-18T08:08:00Z

I fully accept the fact that you want to be built on legacy systems with old versions of compilers.
I just wanted to mention that glibc is leaving assembly implementation of some math routines and rather defines macros that drive the compilation. One can watch a nice presentation from Szabolcs:
https://www.youtube.com/watch?v=IovnxqE5GBQ

Maybe he can be also interested in this thread (@nsz-arm).

nsz-arm · 2019-01-18T10:38:52Z

note, that loop heavy matrix computations may depend on a smart compiler more than scalar fp code which is already close to the target isa. but i do think that keeping asm up to date for new cpus is bigger effort in long term than improving the compiler. if hand written asm can beat the compiler by significant margin then either the compiler or the language needs to improve. (as for asm, i now prefer inline asm to .s for a function, to let the compiler generate the right prologue/epilogue for various extensions like security hardening or profiling instrumentation, but yes inline asm constraints are hard.)

brada4 · 2019-01-18T13:10:53Z

I don't see how passive description of ideal (and non-existant) compiler sums up as measurable improvement.

martin-frbg · 2019-01-18T13:44:30Z

@brada4 you may want to look up their names (hint: gcc developers)

brada4 · 2019-01-18T14:05:21Z

Well , ideal compiler would not need blas, just write quadruple loops and gives perfect result in minimal time.

fenrus75 · 2019-01-18T19:13:16Z

also please look at the skylakex versions of some of those, I've been transcribing them in C (with intrinsics). These versions generally work on multiple generations of hardware.

brada4 · 2019-01-19T17:53:33Z

One problem I see is that compilers are distributed with -march=verybasic and no -mtune=generic, so compiler is not passed the whole idea that for common case it actually must unroll into (like 4x) longer chains, without noticeably hurting that very basic architecture.

Say Intel compilers would generate multiple unroll code paths and overload to best ones at runtime, sort of what OpenBLAS does by hand, but at runtime.
Clang would do much heavier arithmetic/symbolic optimisations (but fortran part where those would matter more is quite immature)

martin-frbg closed this as completed Apr 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: why is assembly language so heavily used #1968

Question: why is assembly language so heavily used #1968

marxin commented Jan 17, 2019

marxin commented Jan 17, 2019

martin-frbg commented Jan 17, 2019

rguenth commented Jan 17, 2019 via email

brada4 commented Jan 17, 2019

marxin commented Jan 18, 2019

nsz-arm commented Jan 18, 2019 via email

brada4 commented Jan 18, 2019

martin-frbg commented Jan 18, 2019

brada4 commented Jan 18, 2019

fenrus75 commented Jan 18, 2019

brada4 commented Jan 19, 2019

Question: why is assembly language so heavily used #1968

Question: why is assembly language so heavily used #1968

Comments

marxin commented Jan 17, 2019

marxin commented Jan 17, 2019

martin-frbg commented Jan 17, 2019

rguenth commented Jan 17, 2019 via email

brada4 commented Jan 17, 2019

marxin commented Jan 18, 2019

nsz-arm commented Jan 18, 2019 via email

brada4 commented Jan 18, 2019

martin-frbg commented Jan 18, 2019

brada4 commented Jan 18, 2019

fenrus75 commented Jan 18, 2019

brada4 commented Jan 19, 2019