-
Notifications
You must be signed in to change notification settings - Fork 695
Alignment / undefined behavior for loads & stores #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Let me try to summarize this, to check if I understand you:
Is that right? If so, I like it. |
Remember, the polyfill is just JS so we can do whatever we want and use this flexibility to avoid adding a wart to WebAssembly that would live on in perpetuity. For example, we could have a non-standard (and thus ignored by native impls) "polyfill" section of the module that listed info that is only useful to the polyfill. I already have an example of this: the exact length of the decoded asm.js chars (so that a right-size buffer can be malloc'd up front by the polyfill). To address the issue at hand, we could extend the polyfill section to store a sorted list of offsets of unaligned-safe loads/stores which the polyfill would then use to easily generate byte-loads-and-ors. |
@kripken Precisely. My ideal solution is to go from two opcodes:
to four:
Where the aligned/unaligned hints are promises to the VM/runtime, much like they are on x86. In practice once the polyfill is no longer in use, the aligned/unaligned opcodes will be effectively equivalent. IIRC the aligned-only restrictions on JS are in service of janky architectures like (old) ARM, so we definitely shouldn't shackle ourselves to them, especially in a scenario like this where it can negatively impact production code and hinder adoption. |
@lukewagner I agree with you on principle, but TBH I can't imagine an uglier wart than 'unaligned loads/stores are undefined behavior, sometimes'. Generating a sorted list of offsets sounds extremely complicated compared to two new opcodes, especially given that the opcodes can effectively be aliases in a native implementation. That table will get pretty big for applications with many unaligned ops... |
The proposed semantics are "unaligned loads/stores work" (see V1.md#heap). The polyfill just doesn't implement the spec correctly (although, as Alon's PR for the high-level-goals points out, it could). |
@lukewagner: I am not sure this would be limited to the polyfill. There is a reason LLVM IR has that information on each load and store (even the amount of alignment, e.g. a 4-byte load could be 2-byte aligned). Aren't there CPU instructions that are faster if pointers are aligned at runtime, and others that are slower, but that don't care between aligned and unaligned loads and stores? I believe @sunfishcode was telling me about those, but also possible I misunderstood. If I did get that right, it seems like this information in wasm would allow us to emit better code in the native impl, not just the polyfill. |
It's my understanding that A quick search showed discussions as recently as two months ago about the performance impact of unaligned operations vs aligned ones (like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 ). I can believe that we eventually would regret having aligned/unaligned operations in this spec, but the cost of them is pretty minimal - they become aliases, so we're just eating two opcodes, and they're very simple aliases. Since we intend to have a table that maps named opcodes to integers, applications that don't use the unaligned stores won't pay any actual cost for their existence. Essentially, the burden of implementing these lies in the polyfill, along with some implementation effort in a compiler if you decide to use them. If you don't care about making unaligned ops work consistently, your compiler can just always emit aligned stores. |
Yes, movaps/movups discussion came up (a lot) in the context of SIMD.js. To wit, Dart and Odin both emit movups now so having a separate loadUnaligned wouldn't buy anything unless there was a strong assumption that emitters of loadAligned really expected the load to be aligned (b/c movaps means trap handler on unaligned access). That being said, Dan reminded me that the old-ARM-fault-on-unaligned case is symmetric with the polyfill case: you don't want to emit byte-loads because this is suboptimal on x86/new-ARM but you don't want to run 100x slower on old-ARM. So that at least gives us a non-polyfill use case (even if it also is diminishing over time) which makes me more positive on having a loadUnaligned. (On a bikeshed-y note: since the assumed-aligned loads wouldn't throw on unaligned, I think it'd be better to just name them 'load'.) |
Yeah, I don't have any particular attachment to 'loadAligned' or anything, just the distinction. I don't think there's any reason for them to have different semantics on native - both loadAligned and loadUnaligned generating movups is fine. My intent here is focused on behavior, not performance - using movups everywhere isn't observably different from using movaps for aligned loads & movups for unaligned loads, as long as you do it correctly. Having the webasm failure case for 'you did an aligned load from an unaligned address' be undefined behavior (masking, etc) instead of an invalid instruction trap or whatever is pretty sensible too, I think. The performance argument re: ARM is interesting though. I was under the impression old-ARM wasn't considered important anymore, but maybe it's shipping in more modern devices than I realized. |
Just to be really clear: the proposal was never to have undefined behavior on unaligned access in the spec and, even with proposed new (load|store)Unaligned opcodes, if C++ code falls into UB which causes Emscripten to emit a load (aligned) opcodes, they will still be silently masked by the polyfill. I think the best way to summarize the addition here is that we've added an op for the cases where you know you want to do unaligned access but you don't want to sacrifice perf on x86/new-ARM by emitting the byte loads that Emscripten does now. Basically, we push this choice from the developer-side compiler into the client-side compiler. I already have a todo to update the heap section (move stuff from V1 into AstSemantics now that it exists, include discussions about out-of-bounds), so I can add a cset in the PR to make this change if that sounds good to you. |
That sounds good. Maybe the root of our disconnect here is that I'm assuming the polyfill is intended to be spec-compliant. It sounds like you're saying that UB on unaligned access was never intended to be spec-compliant, but the intent was for the polyfill to do that? In which case, updating the heap section of V1/astsemantics sounds sufficient, and adding an op for this should be more than enough to make it possible to get guaranteed correct behavior in both polyfill & native. |
Yes, as Alon also pointed out, it was non-obvious that the polyfill's goals weren't to be 100% correct wrt the spec, but just "good enough to work for most use, as well as asm.js works today" misalignment/oob aren't really considered "working today". Alon's PR (which I'll merge after others have had a chance to comment) should fix that by explicitly including such text in the high-level goals. |
FWIW I don't think we should target pre-v7 ARM. Alignment shouldn't be an issue in v7/v8 when the kernel is properly configured. |
How about sparc and power hardware? |
Endianness would be bigger issues there ;-) |
I seem to remember seeing some advertisements that indicate Imagination is trying to push MIPS now, too: |
I like the plan discussed here. We shouldn't pessimize our favorite platforms to support armv6 and similar processors, but in this case it looks like it'll be quite easy and painless, so I think it's worth doing. Also, SIMD use cases for unaligned accesses are mostly distinct from scalar use cases, so we should have a separate discussion when it comes to adding SIMD (post-v1 as it's currently scheduled). |
Is there any benefit to doing the full LLVM thing where e.g. a 4-byte load can be 2-byte aligned, or just have "fully aligned" vs "fully unaligned" for each size? |
Another potential use case for this I was wondering about (but no evidence yet; I can ping Intel/ARM contacts) is whether various new low-power chips (say x86 Quark) will "support" unaligned loads/stores, but very slowly, much slower than byte-loads + or. With the unending push for smaller devices (IoT... sorry, I had to say it), it does seem likely we'll regress on unaligned loads just as we did back when mobile started. Anyhow, argument by itself wouldn't stand, but it does serve to "pile on". |
Agreed with @kripken: at this point it makes sense to let any load/store specify an alignment. If the load/store lied then it gets UB (which will be platform dependent). If no alignment is specified then we assume natural alignment (and again UB on under-alignment). |
Whoa there, I don't think we want any UB in the spec. The worst that happens if you lie about alignment is your loads/stores go way slower (b/c take faults or other slow paths). There is the separate issue that the polyfill will mask your aligned loads, but this is being treated as a performance cheat in the polyfill (one that can be turned off, at the cost of running slower), not UB in the spec. |
The benefit at the WebAssembly level of specifying a specific alignment would be on platforms that require alignment and on the asm.js polyfill that e.g. 2-byte alignment on a 4-byte load can be implemented by 2 2-byte loads, rather than 4 1-byte loads. That's a nice benefit, though it's not huge. On the other hand, the cost is just a few more variants of loads to define, which is a mild burden, but a pretty small one beyond what we already have just by having the basic unaligned access constructs. I'm ok either way. |
How common is it to have a non-natural-word-aligned load/store in LLVM? Does Emscripten currently optimize such loads/stores (e.g., into 2 2-bytes)? |
I agree that we should avoid UB like the plague. There is a philosophical On Tue, May 5, 2015 at 5:02 PM, Luke Wagner [email protected]
|
Yes, Emscripten will use the highest alignment LLVM says is ok to use, which can be more than 1 but less than the full size of the load/store. I don't know if it's common. I suppose it depends on how often people use packed structs... |
It's hard to have a strong opinion (on this sub-natural alignment sub-issue): we have the general goal of minimalism on the one hand and the simplicity/obviousness of the ops on the other hand and pretty mild use cases. |
How slow of an implementation is acceptable? A faulting memory access can be handled with a signal handler by the implementation :-) If we ship That's why I would rather have developer-supplied alignment specification, with implementation-defined behavior when the developer lied. We can decide to align to a natural boundary (some HW does this automtically), or catch and handle the fault on HW that faults, and on other HW it'll just be slower. I think mandating one or the other will lead to regret as we deploy wasm on other ISAs. |
@jfbastien Yes, on the few archs that fault, the intention is that the faulting access is handled by a signal handler. The only difference between aligned/unaligned is (from a spec pov) non-normative: the unaligned shouldn't go 100x slower if you feed it unaligned addresses. |
That would also be doable with developer-specified alignment on every load/store. Furthermore (this one is for @titzer) having better alignment information is wonderful for auto-vectorization (which is definitely not something we aim to do any time soon). |
I'm not against specifying alignment; it might actually be really useful to On Tue, May 5, 2015 at 6:49 PM, JF Bastien [email protected] wrote:
|
I'd go with implementation-defined for now, with a list of acceptable effects. Tentatively:
We can use this to later constrain what's an acceptable implementation! Leave it open for now, and once we have implementation experience and developer feedback we'll know what the right thing to do is. |
@titzer That is the central question, but I think the answer has to be "punish performance". Why? Because if we say "punish semantics" that means "throw". That means that on platforms that have no problem with unaligned loads/stores (x86, new ARM), we'd have to add extra guards unless we specified the behavior to be non-deterministic (which sounds bad). With "punish performance", you only get the perf cliff if you perform an unaligned load with a load-aligned and the "fix" is to use load-unaligned. The devtools profiler should point a big red arrow right to this and this is C++ UB territory anyway, so I don't anticipate this being a real problem in practice. |
@jfbastien Shrinking the set of allowable behaviors is not a backwards compatible operation. I think we should start with fully deterministic since it's not even clear if the platforms that fault on unaligned loads matter. |
It is backward compatible for developers: they were told one of these could occur, so they can't rely on a specific one occurring. |
Ben has already stated why this isn't simply the case. Also, I think we have an (implicit, though perhaps it should be made explicit) high-level goal of minimizing non-determinacy. Given that this alignment issue isn't even an issue on all major platforms, this seems like a clear case for fully-deterministic. |
My original point is that punishing semantics, with either "undefined On Tue, May 5, 2015 at 7:15 PM, Luke Wagner [email protected]
|
IIUC I'm proposing to embrace implementation-defined, you don't want that, but the current proposal has it (developer says access is aligned, and it isn't). Is that correct? We should have a wider discussion about implementation-defined behavior, so I'll be happy if we circle back to this specific issue, but I definitely want to understand what you're proposing we do! |
Again: the current proposal does not have implementation-defined behavior: load and loadUnaligned would be semantically identical (performance is not part of the specification). We can have sections explaining perf cliffs etc, but they'd be non-normative. |
OK. I'm not sure I want us to standardize these separate aligned/unaligned opcodes then. If they're the same then they should be the same. It's almost like adding branch or inlining hints. |
There are use cases which they address (I won't restate them again here), so it would be more effective to argue why the use cases don't matter or how there is a better solution. |
#34 added alignment attributes to the load and store operands. I believe that addresses the concern here, so we can close this issue. Feel free to file new issues if there are further concerns. |
Related to #23 , the current proposed bytecode has a single load op and a single store op.
The load/store ops as proposed accept addresses in bytes (not elements).
Discussion with @lukewagner and @kripken in IRC suggests that the currently intended behavior is for aligned loads/stores to work, and for unaligned loads/stores to have undefined behavior, most likely one of the following:
As proposed this would mean one of two things for the polyfill:
I believe this is unacceptable. I think we need a reasonable solution for this problem, so that the vast majority of applications can work in the polyfill and have acceptable performance. In the long run people will be running wasm applications via a native VM, at which point this won't matter, but it's my belief that the polyfill will stick around for years so it will still be important that these basic scenarios work acceptably.
My preferred solution would be to have 'aligned' and 'unaligned' load/store operations, similarly to
movups
/movaps
on x86 and mirroring what is currently expressible in JS (HEAPU32[i]
vsHEAPU32[i >> 2]
vs(memcpy(temp, i, 4), temp[0])
). This would allow compilers to produce code that will be functional in all cases, and efficient in scenarios where unaligned and aligned accesses are equivalent. In scenarios where they are not, this 'safe' code would still run, at some performance cost. In cases where performance is desirable, a compiler can generate aligned accesses that would have undefined behavior in edge cases, as we have in the current C/C++ spec.Essentially this would allow the high performance of native C/C++ to be maintained by generating fast, unsafe 'aligned' operations, which matches the current behavior of emscripten and is conformant to the spec. Simultaneously it would allow languages without this UB black hole, like C#, to generate code that uses aligned stores in most cases and unaligned stores in the places where it needs them.
There are some alternate solutions here to consider, for example:
The proposed behavior isn't a divergence from current JS, since unaligned accesses against typed arrays don't exist. On the other hand, there are enough scenarios where unaligned accesses matter (file IO, byte-wise unpacking - like the proposed binary format - and GPU buffers among others) that I think we should handle it with care.
The text was updated successfully, but these errors were encountered: