-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Question: why is assembly language so heavily used #1968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Most of the code in question was written at a time when gcc 4.x or even 3.x was current, and is also expected to compile with a wide range of compilers and versions. (Not everybody - especially users of compute clusters at research institutions - can be expected to update their compiler, so some are still |
I suppose the asm()s are there because in _16 we know that
n % 16 == 0? Also eventually micro-arch dependent "optimal"
unrolling is performed here. There's now #pragma GCC unroll
which triggers unrolling (on RTL) so if you want to unroll
the vectorized loop you can do
#pragma GCC unroll 4
before the loop. That n % 16 is zero is easiest communicated
by iterating up to n' * 16 instead of n.
I do expect assembly to be eventually the best performing
variants but then I'd expect those to be written in .s files
rather than using inline assembly.
Using vector intrinsics might be another option.
All changes require work of course so fixing up the
bad inline asm()s is top priority.
|
If you manage to fine-tune C code to do better than assembly contribution is more than welcome. |
I fully accept the fact that you want to be built on legacy systems with old versions of compilers. Maybe he can be also interested in this thread (@nsz-arm). |
note, that loop heavy matrix computations may depend on a
smart compiler more than scalar fp code which is already
close to the target isa.
but i do think that keeping asm up to date for new cpus
is bigger effort in long term than improving the compiler.
if hand written asm can beat the compiler by significant
margin then either the compiler or the language needs to
improve.
(as for asm, i now prefer inline asm to .s for a function,
to let the compiler generate the right prologue/epilogue
for various extensions like security hardening or profiling
instrumentation, but yes inline asm constraints are hard.)
|
I don't see how passive description of ideal (and non-existant) compiler sums up as measurable improvement. |
@brada4 you may want to look up their names (hint: gcc developers) |
Well , ideal compiler would not need blas, just write quadruple loops and gives perfect result in minimal time. |
also please look at the skylakex versions of some of those, I've been transcribing them in C (with intrinsics). These versions generally work on multiple generations of hardware. |
One problem I see is that compilers are distributed with -march=verybasic and no -mtune=generic, so compiler is not passed the whole idea that for common case it actually must unroll into (like 4x) longer chains, without noticeably hurting that very basic architecture. Say Intel compilers would generate multiple unroll code paths and overload to best ones at runtime, sort of what OpenBLAS does by hand, but at runtime. |
It's question that I noticed when I was reducing #1964.
So If I see correctly the problematic code can be rewritten in C as follows:
which generates vectorized code for both functions (testing GCC 9):
gcc blast2.c -c -S -O2 -march=haswell -fdump-tree-all-details -ftree-vectorize
So my question is why have you been using arch-specific assembly code so much?
Thanks
The text was updated successfully, but these errors were encountered: