-
Notifications
You must be signed in to change notification settings - Fork 1.6k
COPY memory alignment SIGSEGV bug #1137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Segfault happens on Nehalem as well, copy_sse2.S appears to be original code from libgoto. I do not have time today to investigate further - maybe there is something unusual about the testcase if it comes up only now, maybe the assembly "only" needs the proper .align in the proper place ? |
For the record, gdb puts the fault at line 216 of copy_sse2.S, which is |
I guess the problem is violation of Intel ABI, pp. 13-14:
@brada4: Is turning off SSE also for properly aligned memory really a good idea in view of performance? I would be happy with the following resolution:
|
SSE2 is omnipresent on x86_64, and it is emitted at any compiler, and _copy (Or any L1 BLAS) routines are completely memory bound. Even glibc has AVX memory copy routines nowadays, probably SSE assembly is even slower (memory bound... i.e spends 2x more CPU cycles in same time) than memcpy in the border case. |
@martin-frbg what makes me wonder that sandybridge is not affected. |
Wouldn't the test case above (with offset=8 of course) allow benchmarking of sse vs. compiler-picked implementation rather than handwaving ? |
Most kernels assume aligned input anyway. But that does not mean need to crash in unaligned cases. |
Maybe it is just the cblas interface that should ensure proper alignment before calling the kernel function (but that would seem to create the weird situation of using a potentially inefficient copy implementation just to drive the hand-optimized one) ? At least playing with the .align statements already present in copy_sse2.S does not appear to help, while replacing the two affected movaps instructions at label L15 with movups does. Still no benchmarking to see which version wins. |
My preliminary tests suggest that replacing the movaps instructions of copy_sse2.S with movups has almost negligible impact on Nehalem at least, while the plain C implementation from arm/copy.c is significantly slower. (320 vs 500 seconds for the aligned version of check_copy.c with nm=300 and an outer loop that does 10 million invocations of the loop on "offset"). |
From various sources, it appears that movups is identical to movaps in performance when accessing aligned memory on Intel targets since Nehalem. For AMD processors - which I do not have available - I only find this feature expressly mentioned in their Software Optimization Guide for the "Family 16h" processors, but not for 15h (Bulldozer et al.) |
Why not check memory alignment on each call and fail with a meaningful error message on misalignment instead of a SIGSEGV? Checking alignment costs just an OR operation of the address with a mask of |
Have to defer to @xianyi and other core developers on this, but to me this sounds expensive even if it is "just an OR operation" on every call. |
The other option is SIGSEGV. It seems worth the cost. There are other checks on the input, why is this one singled out for exclusion? |
All kernels consider inputs somewhat aligned, probably bigger than size of variable itself. |
@mattip back then I had hoped for additional comments, but the fact that the topic went dormant seems to suggest that most callers ensure that their data is aligned, even if "only" for performance. On the other hand, my misgivings about potential performance impacts from using movups were related to hardware that was already old back then, and would be even more outdated two years on. |
Though a comment, Intel claims unaligned instructions are performance-neutral. |
#1137 (comment) - the open question is (or was) if this is the case on older (and specifically older AMD) cpus as well. |
That is if U on old processors is detrimental to A cases? |
Exactly. No need to worry about future performance of U cases when they just blow up right now. |
Why people stick to those old things, affected processors (CORE2 & OPTERON & OPTERON_SSE3 & oldest half of ATOM) are quite on par with Raspberry Pi3 for numerical computations.... (with the remark about electricity bill) |
Core2 generation no idea (perhaps some poor undergrads), Atom is still valid for current generation and could be used for pattern detection in camera setups or the like. At least i see no need to "kill" the old targets, or to degrade their performance even further just to cater to some unusual and on the whole undesirable special case. (Are there any examples where unaligned data can/will show up and not be easily caught by some program before it passes them to BLAS, or is this more of an academic exercise ?) |
Hello, I hit this bug today on a recent Whiskey Lake processor. To elaborate on the comments above, I would like to give my point of view as an OpenBLAS user. It seems to me that calling BLAS routines with unaligned data is not a mere theoretical exercise: it can happen in practice if you allocate a large matrix/vector in the first place and then perform BLAS operations on sub-blocks. Moreover, I think that a significant part of BLAS library users are unaware of memory alignment issues; they may move away from OpenBLAS if faced with crashes while they use it in compliance with the standard interface. Finally, even those who are aware of the problem will be reluctant to modify their code to take into account specific requirements of OpenBLAS, when they can use alternative BLAS libraries. Thus I think that the memory alignment issues should be handled on the BLAS library side rather than in the application code. Programming in OpenBLAS is beyond the scope of my duties and skills, but I am happy to contribute to solve this bug by testing. Here is my system information:
Thank you. |
@toulorge-cenaero got any backtrace? In principle even with your theoretical excersise the values will be aligned as in the beginning, unless you resort to numerology by shifting float by bytes. Not sure about BLAS library users you refer to - those using MKL are advised to use MKL-s internal allocator if unable to get aligned allocations from operating system. |
@brada4 Here is the OpenBLAS part of the backtrace when run with my application:
When I run the minimal example provided in the first comment, I get the fault at line 216 of copy_sse2.S as mentioned previously.
I am not sure we understand each other. Let's imagine that I allocate memory for a 4x5 float matrix at a pointer
As explained above, I do not think that my problem is related to the allocation itself. I was not referring to any specific BLAS library: my point is rather that BLAS being a standard interface, it seems to me that most users expect BLAS libraries to be interchangeable so that their code is easily ported to different systems. Please do not get me wrong: I am happy and very grateful that I can use OpenBLAS on my system. Nevertheless, I am reluctant to handle requirements of specific BLAS implementations in my application beyond those of the standard interface. |
It is trivial to replace the |
MOVUPS takes more cycles than MOVAPS on old processors even in aligned case. (2 vs 5), it is still sustainable as long as CPUs are clocked well better than RAM. |
It is common sense to align 4-byte (float) to 4 bytes, yours is not, glibc memcpy() will crash same way, |
@martin-frbg does it look viable to check alignment in sse copy and fall back to C code in unlikley() case? |
@martin-frbg Your solution allows me to correctly run both the check_copy.c example and my application. I am not so surprised that my application runs smoothly after fixing DCOPY: if there was a similar problem in the BLAS Level 2 and Level 3 routines that we mostly use, I guess somebody would have found it much earlier. Thank you for your help. I hope that this fix or a similar one will be included in the next release of OpenBLAS. @brada4 The data is obviously 4-byte aligned here... The problem is the 16-byte alignment (i.e. the width of 4 floats) required by SSE instructions, whereas the sub-vector or sub-matrix that we want to copy does not necessarily start at an index that is a multiple of 4 within the larger allocated vector or matrix. |
The iamax_sse2 routines have both code paths implemented. The check for aligned or unaligned acess is done e.g. on line 94-100. |
@MigMuc thanks. Interestingly, copy_sse2.S does already have some concept (and handling) of (un)ALIGNED_ACCESS, though this attribute does not appear to be defined for HASWELL and similar "recent" kernels in either l1param.h or l2param.h (Which leads to the check being performed on Y rather than X if I read it correctly - but still if I read it correctly it would appear that those cpus where unaligned accesses would incur a performance penalty are already defining ALIGNED_ACCESS and hence hitting the other codepath.) |
Currently I am trying to write an avx2 version of the copy routines for Haswell and Zen CPUs. Hence I am using the dscal.c implementation as a starting point. When compiling I get some issues with the immintrinsics which I do like in the dscal_microk_skylakex-2.c way. I cannot find where in the Makefiles the -mavx2 switch is set for ZEN. |
Makefile.x86_64 should be the place for this - but as (I think) none of the relevant kernels uses immintrinsics on ZEN it has not come up as a problem that |
Minimal example check_copy.c, compiled via
Error occurs for
offset == 1
:Changing the increment of
offset
tooffset += 8
makes the code run fine.System:
I know it is not a good idea to have the vectors not aligned at 8 byte address boundaries. However, my expectation would be that the code does not crash in this case.
The text was updated successfully, but these errors were encountered: