Skip to content

Question: why is assembly language so heavily used #1968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marxin opened this issue Jan 17, 2019 · 11 comments
Closed

Question: why is assembly language so heavily used #1968

marxin opened this issue Jan 17, 2019 · 11 comments

Comments

@marxin
Copy link
Contributor

marxin commented Jan 17, 2019

It's question that I noticed when I was reducing #1964.

So If I see correctly the problematic code can be rewritten in C as follows:

typedef long BLASLONG;

void
__attribute__((noipa))
saxpy_kernel_16( BLASLONG n, float * restrict x, float * restrict y, float alpha)
{
  for (BLASLONG i = 0; i < n; i++)
    y[i] += alpha * x[i];
}

int saxpy_k(BLASLONG n, float da, float * restrict x, BLASLONG inc_x, float * restrict y, BLASLONG inc_y)
{
  if (inc_x == 1 && inc_y == 1)
    saxpy_kernel_16 (n, x, y, da);
  else
  {
    for (BLASLONG i = 0; i < n; i++)
      y[i * inc_x] += da * x[i * inc_y];
  }
}

which generates vectorized code for both functions (testing GCC 9):
gcc blast2.c -c -S -O2 -march=haswell -fdump-tree-all-details -ftree-vectorize

	.file	"blast2.c"
	.text
	.p2align 4
	.globl	saxpy_kernel_16
	.type	saxpy_kernel_16, @function
saxpy_kernel_16:
.LFB0:
	.cfi_startproc
	testq	%rdi, %rdi
	jle	.L11
	leaq	-1(%rdi), %rax
	cmpq	$6, %rax
	jbe	.L7
	movq	%rdi, %rcx
	vbroadcastss	%xmm0, %ymm2
	xorl	%eax, %eax
	shrq	$3, %rcx
	salq	$5, %rcx
	.p2align 4,,10
	.p2align 3
.L4:
	vmovups	(%rsi,%rax), %ymm1
	vfmadd213ps	(%rdx,%rax), %ymm2, %ymm1
	vmovups	%ymm1, (%rdx,%rax)
	addq	$32, %rax
	cmpq	%rcx, %rax
	jne	.L4
	movq	%rdi, %rax
	andq	$-8, %rax
	testb	$7, %dil
	je	.L13
	vzeroupper
	.p2align 4,,10
	.p2align 3
.L6:
	vmovss	(%rsi,%rax,4), %xmm1
	vfmadd213ss	(%rdx,%rax,4), %xmm0, %xmm1
	vmovss	%xmm1, (%rdx,%rax,4)
	incq	%rax
	cmpq	%rax, %rdi
	jg	.L6
.L11:
	ret
	.p2align 4,,10
	.p2align 3
.L13:
	vzeroupper
	ret
.L7:
	xorl	%eax, %eax
	jmp	.L6
	.cfi_endproc
.LFE0:
	.size	saxpy_kernel_16, .-saxpy_kernel_16
	.p2align 4
	.globl	saxpy_k
	.type	saxpy_k, @function
saxpy_k:
.LFB1:
	.cfi_startproc
	movq	%rdx, %rax
	movq	%rcx, %rdx
	cmpq	$1, %rax
	jne	.L21
	cmpq	$1, %r8
	je	.L15
.L21:
	testq	%rdi, %rdi
	jle	.L29
	leaq	0(,%rax,4), %rcx
	salq	$2, %r8
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
.L20:
	vmovss	(%rsi), %xmm1
	vfmadd213ss	(%rdx), %xmm0, %xmm1
	incq	%rax
	addq	%r8, %rsi
	vmovss	%xmm1, (%rdx)
	addq	%rcx, %rdx
	cmpq	%rax, %rdi
	jne	.L20
	ret
	.p2align 4,,10
	.p2align 3
.L15:
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	call	saxpy_kernel_16
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.p2align 4,,10
	.p2align 3
.L29:
	ret
	.cfi_endproc
.LFE1:
	.size	saxpy_k, .-saxpy_k
	.ident	"GCC: (GNU) 9.0.0 20190116 (experimental)"
	.section	.note.GNU-stack,"",@progbits

So my question is why have you been using arch-specific assembly code so much?
Thanks

@marxin
Copy link
Contributor Author

marxin commented Jan 17, 2019

@martin-frbg
Copy link
Collaborator

Most of the code in question was written at a time when gcc 4.x or even 3.x was current, and is also expected to compile with a wide range of compilers and versions. (Not everybody - especially users of compute clusters at research institutions - can be expected to update their compiler, so some are still
stuck with whatever their version of e.g. RHEL contained.) The recent code contributions for Intel Skylake X make heavy use of AVX512 intrinsics instead, but they will simply fall back to the old Haswell kernels if the compiler is not up to this.

@rguenth
Copy link

rguenth commented Jan 17, 2019 via email

@brada4
Copy link
Contributor

brada4 commented Jan 17, 2019

If you manage to fine-tune C code to do better than assembly contribution is more than welcome.
Intel C compiler will plainly ignore most GCC pragmas, clang a bit less.
Also we should not break currently sold/supported LTS OS distributions, e.g. RedHat 5, Ubuntu 12 etc.
The inline factor can be simulated by inlining C statements too, which is also done in some microarch-specific kernels.
You show generated assembly, which sort of has same instructions, but real processors have execution resources that may block for few cycles if instructions are not well interleaved, say CPU can do 4 float multiplications and one memory load at once, so assembly will sequence loads to half of registers, do multiplication on previous set loaded, then pre-load next half, so it never blocks. Same instructions in different sequence will block before multiplication for example.

@marxin
Copy link
Contributor Author

marxin commented Jan 18, 2019

I fully accept the fact that you want to be built on legacy systems with old versions of compilers.
I just wanted to mention that glibc is leaving assembly implementation of some math routines and rather defines macros that drive the compilation. One can watch a nice presentation from Szabolcs:
https://www.youtube.com/watch?v=IovnxqE5GBQ

Maybe he can be also interested in this thread (@nsz-arm).

@nsz-arm
Copy link

nsz-arm commented Jan 18, 2019 via email

@brada4
Copy link
Contributor

brada4 commented Jan 18, 2019

I don't see how passive description of ideal (and non-existant) compiler sums up as measurable improvement.

@martin-frbg
Copy link
Collaborator

@brada4 you may want to look up their names (hint: gcc developers)

@brada4
Copy link
Contributor

brada4 commented Jan 18, 2019

Well , ideal compiler would not need blas, just write quadruple loops and gives perfect result in minimal time.

@fenrus75
Copy link
Contributor

also please look at the skylakex versions of some of those, I've been transcribing them in C (with intrinsics). These versions generally work on multiple generations of hardware.

@brada4
Copy link
Contributor

brada4 commented Jan 19, 2019

One problem I see is that compilers are distributed with -march=verybasic and no -mtune=generic, so compiler is not passed the whole idea that for common case it actually must unroll into (like 4x) longer chains, without noticeably hurting that very basic architecture.

Say Intel compilers would generate multiple unroll code paths and overload to best ones at runtime, sort of what OpenBLAS does by hand, but at runtime.
Clang would do much heavier arithmetic/symbolic optimisations (but fortran part where those would matter more is quite immature)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants