-
Notifications
You must be signed in to change notification settings - Fork 695
Alignment will probably require implementation-defined behavior #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is there any documentation available on these ARM architectures? I'm interested in learning more. |
Me too. Specifically, I wonder if those ARM implementations just silently do the rounding (that would be exactly what JS typed arrays do, ironically :) ? Or do they trap? |
I seem to remember your proposal roughly being the consensus from prior discussions. Obligatory aligned/unaligned distinction, with unaligned operatoins Always Working but possibly being slow, and aligned-with-unaligned-address being potentially undefined seems good to me, albeit a little gross. That distinction is already really important for the polyfill to be remotely usable without breaking applications that do unaligned loads/stores. The last time I shipped ARM code (on a particular handheld console), it trapped on unaligned accesses in some scenarios (non-32-bit load/store) and was Just Slow in other cases. I think in some cases you can configure the behavior, so it might depend on the OS/host application and not just the hardware. |
I thought we had agreed to have explicit alignment to a specific byte number (not just true/unknown). The rest is what I recall: if the program lied then implementation-defined behavior occurs. I wouldn't spec the sanitizers: they can either be done by the developer-side compiler, or by the implementation (maybe behind a flag). I see sanitizers as tools that should "just work", so there's no need to spec them. The ARM specs aren't accessible publicly, but you can get the PDF for free by registering. This behavior, IIRC, is pre-ARMv7 and in some R and M profile CPUs. Most ARM CPUs sold in consumer devices recently are ARMv7 A profile, or ARMv8, but it would be nice for Web Assembly to work on these other CPUs which are often used in smaller IoT devices (you know we want Web Assembly to be IoT compliant!!!). |
Here's a link to a section in the ARM architecture reference manual: On Tue, Jun 2, 2015 at 7:02 PM, Dan Gohman [email protected] wrote:
|
By my reading of the documentation: ARMv5 and earlier have the alignment-rounding problem. ARMv6 has multiple configuration modes. The "Legacy" mode behaves like ARMv5. However, many popular ARMv6 implementations, such as Linux on Raspberry Pi, seem to use one the newer modes that don't have the problem. In ARMv7 and ARMv8, documentation I have says that the "Legacy" configuration mode is no longer present, and they don't have the problem. Assuming I didn't miss anything, this appears to come down to a question of the limits of portability (#38). Is ARMv5 or ARMv6-in-legacy-mode worth supporting, at the cost of weakening the spec wrt alignment? |
Thanks for summarizing this! ARMv5 is pretty old. I think we'd have to have a super good argument in its favor if we wanted to complicate the spec with it. -Fil
|
For us, only ARMv7 THUMB/THUMB2 matter. Of course we aren't in a vacuum so I'm fine making concessions where necessary, but it doesn't sound like ARMv5/legacy mode is important enough to weaken the spec. |
Good catch, Dan. I also just verified that the arm64 specification only requires alignment V8 cares about architectures in roughly this order: X64, ia32, arm, arm64, I'll do some digging into those few at the end and see if there are any On Tue, Jun 2, 2015 at 9:52 PM, Dan Gohman [email protected] wrote:
|
I've checked with some MIPS and PPC experts and the result is this: no problem on PPC (should be Intel-fast), and MIPS cores trap to kernel for emulation, but chips are coming that just do it in hardware. So it looks like we're all good if we make the reasonable decision to ignore 10 year old arm cores. I'll double check with the folks at ARM, though. |
@titzer it's not just older ARM core: it's low-power / embedded ones too. I've talked to folks running node.js on tiny chips inside lightbulbs, do we care about this type of user? To which degree? I'm probably OK saying: we expect fully compliant Web Assembly implementations to have behavior X, but some not-too-compliant implementations could do Y. I'd rather not ban this behavior outright because I think the usecase matters. It would be nice to have a compliance suite, and implementations can list how they diverge from the spec. When it's "benign" divergences like this I think it's fine. |
Would those older ARM cores and tiny low-power embedded chips have larger divergences from "normal" behavior than the polyfill will? Given wasm code that properly annotates the alignment of loads and stores (never says they are aligned when they aren't), both those chips and the polyfill will perform properly, is my understanding correct? |
On Thu, Jun 4, 2015 at 8:10 PM, Alon Zakai [email protected] wrote:
Cores that trap will go to the kernel and the user program only pays when
|
@jfbastien Can you be more specific about which models of ARM cores these are? I've checked ARMv7-R and ARMv7-M documentation and both are ok here. |
Looks like ARMv6-M is good too. |
@titzer: not sure I follow? If a load/store is marked as aligned, then it doesn't need to pay any cost, does it? The VM can emit an aligned access, and if the code lied and it turns out unaligned, it's ok that it drops the lower bits - just like the polyfill does. And if the load/store is marked as unaligned, then a slow path would be taken, definitely paying a cost, but likewise, around the same as the polyfill pays. And in practice we hope little code would be marked as unaligned, so both polyfill and older/smaller CPUs would be ok. I feel like the older/smaller CPU case is very similar to the polyfill, overall. Am I missing something? |
On Thu, Jun 4, 2015 at 8:51 PM, Alon Zakai [email protected] wrote:
|
I still don't understand why a claimed-aligned access would require a mask. Why not just emit an access without a mask, on these old/small CPUs? (It might silently drop some bits, but that's what the mask would have done anyhow?) |
On Thu, Jun 4, 2015 at 9:03 PM, Alon Zakai [email protected] wrote:
|
We specifically don't want to be bound by present-day limitations of JS semantics in the long term, so we don't want to get too accustomed to saying "the polyfill did XYZ, so it's ok if other implementations do that too". |
@titzer: Yes, but that is exactly as in the polyfill, and we allow it, don't we? I may have a big misunderstanding here. I was under the impression that if one lied about alignment, claiming it was aligned when it wasn't, then we said that was not fully specified. And the polyfill would then be free to do the "wrong" thing by dropping the lower bits, thus letting it remain fast (otherwise, each load would need to support the case of it being unaligned). In practice, this is fine because the compiler should know what is aligned and what might not be, and we can mark the rare loads which might not be, as unaligned. But 99% of them would be aligned, and fast in the polyfill, and correct in the polyfill. Did I get that wrong? Are we not saying that claiming alignment but lying leads to implementation-defined behavior? |
@sunfishcode: I 100% agree. I wasn't saying that the polyfill does it so it's fine. I am saying that I understood what the polyfill did to be fine because of reason X, and that reason X is valid in itself, and it looks like X applies to old/weak CPUs too. Unless have I misunderstood X all this time? |
Just in case it wasn't clear from start, the goal here was: 1.) If you promise an access is aligned, and it is, you pay nothing, not On Thu, Jun 4, 2015 at 9:22 PM, Dan Gohman [email protected] wrote:
|
@titzer: Yes! :) And is not (2) covered by emitting a load without a mask on those old/small CPUs? You get a forcibly aligned result, which is one of the options you listed. That's all I've been saying here: aligned loads/stores do not need masks in the polyfill nor on old/small CPUs, assuming those CPUs just ignore the lower bits. So both can be fast on aligned code, and also correct if actually aligned, so they are quite similar in that respect. (edit: by "masks in the polyfill" i mean "written in the JS code". While of course the VM must emit a mask, because it is JS and has precise semantics. But if the underlying CPU were a weak/old one which itself drops the lower bits and force-aligns, then the VM could actually avoid that, as if the hardware were specialized for typed arrays being aligned ;) |
On Thu, Jun 4, 2015 at 9:30 PM, Alon Zakai [email protected] wrote:
—
|
The other side here is that we have yet to actually name a CPU here which we really care about which actually needs implementation-defined behavior. Unless this changes, it'd be great to just stick with our current rules, which don't have the implementation-defined behavior part. |
@titzer: Ok, good, now I think we are on the same page. Given
Then in practice, what difference does it make if Is there a practical, concrete benefit to not calling this implementation-defined behavior? |
Every bit of implementation-specific behavior we add is an opportunity for applications to behave differently across different implementations. I'm not opposed to all implementation-specific behavior, but it'd be nice if someone could name something more interesting than ARMv5 before we accept it here. |
Actually a second round with MIPS folks was less promising. Apparently some I'm not clear on why we want an alignment annotation if it doesn't make any On Thu, Jun 4, 2015 at 9:57 PM, Dan Gohman [email protected] wrote:
|
The worry on these platform is that regular accesses either need to be split up into byte accesses and then merged, or signal handling must be used. This isn't a "pay for what you use" approach to performance: you may have no unaligned accesses and performance will suffer, or you'll need to use a signal handler which folks have said they don't want to mandate. See the Linux MIPS docs for details. |
@jfbastien Yes, but what is the nondeterminism buying us in those cases? If you have to branch on misaligned access anyway then you can just as well implement Just Works as something else nondeterministic. The only case I can see nondeterminism buying something is for auto-aligning platforms which would not otherwise have to branch. Is this the MIPS use case? |
... and that is just from the performance perspective. From the perspective of "I want apps that run on other platforms correctly to also run on my auto-aligning platform correctly", then you don't want to be the one oddball platform that auto-aligns; of course apps are going to randomly break for you. That's why I was saying above (and iiuc @pizlonator was also saying) that, even if nondeterminism was a choice, I'd still want to implement Just Works semantics just to minimize bustage. |
Do we have data on what the penalty for misaligned-accesses-do-weird-things platforms will be, if we require misaligned accesses to just work, but then also roll up our sleeves and actually optimize that case? I’ve been pondering this a bit. If you have profiling that tells you what the low bits of a pointer tend to look like, then you can emit optimized code that is biased for either aligned or misaligned, and you could even speculate that the pointer was already aligned which allows you to blow away repeated alignment checks on that pointer - and probably alignment checks on most pointers derived from that one, if the derivatives are just “ptr + C” where C is a multiple of the appropriate word size. Since we probably do not have such data, it seems we have the following to choose from, and the following mitigations in a subsequent version if the performance isn’t good enough: I prefer (1) because it’s the most forward-looking. I like (2) more than (3) because undef has a high likelihood of causing confusion for developers. -Filip
|
I agree with what's said above; nondeterminism in anything other than trapping-or-not doesn't help much because it just converts applications that were slow on said architectures to applications that behave wrong on the same architectures. I still believe "it's nondeterministic whether misaligned accesses trap" (misaligned means dynamic alignment is less than static alignment) is worth considering if we can't do "everything always just works". Implementations on MIPS/etc. might then choose to have two modes, "fast" (traps) and "slow" (branches). "fast" could be the default, and when a program traps (which should be rare), the implementation could (for example) automatically restart the program, blacklisting it to "slow" mode thereafter (for example). Blessing this in the spec means that spec conformance can remain something which is done by default. And this approach would mean that there's no mandate to catch and handle signals, and it would permit "pay for what you use", addressing two of @jfbastien's concerns above. ARMv5 would just have to do "slow" mode, but there's a fair amount of agreement here that ARMv5 is old and not worth complicating the spec for. |
The other important implementation that does masking (i.e. forcible On Tue, Jul 28, 2015 at 4:36 AM, Dan Gohman [email protected]
|
There is a plan for the polyfill. It's a little awkward, but it's an attempt at a practical strategy to break with JS semantics in certain key areas. If an implementor is thinking "the polyfill masks addresses, so why shouldn't I do it too?", we'll remind them that any time the polyfill's alignment masking actually affects anything, then the program doesn't work right under the polyfill. "Program doesn't work right" isn't something that we anticipate implementors should need to emulate [0]. [0] And we aren't worried about programs coming to depend on the polyfill semantics either, because we already know that popular native wasm implementations won't be masking. |
@Tizer asked me to comment here, I work at MIPS/Imgtec on V8. As discussed above, existing MIPS cores trap on unaligned accesses. Any remotely modern kernel will fixup the un-aligned load/store (same result as x86). It just works, but these accesses are slow. Newer cores (in development) will support unaligned accesses in hardware. Of course, code that claims [aligned=true] but lies could tank performance. Detection and deoptimization to safe accesses would be trivial with a signal handler (though we have avoided those due to concerns with sandboxing, etc.) There are pure software methods discussed by others above. So MIPS does not introduce indeterminism, and the performance impact of 'Just Work when misaligned' can be mitigated over time. The debug-mode dev tool support would be excellent. |
@paul99 Just to be clear, though: on all the MIPS archs you're considering, it is possible to trap on unaligned access in user-mode? That would make the MIPS case equivalent to the slow-ARM case we've already been considering. @sunfishcode's comment suggests a hybrid solution that doesn't require any semantically-visible modes: the engine optimistically compiles with trap-on-misaligned and, after a significant number of traps, recompiles into branching (dynamically, swapping out on-stack or, if nothing else, between turns of the event loop). |
@lukewagner IIUC you don't need to trap in usermode, the kernel traps and fixes up the access for you and usermode goes on without knowing about this. |
@jfbastien I realize that, but I was asking if it was possible since that enables several of the things we've been talking about. |
On Linux MIPS, according to the docs linked to above, a process can easily chose which it wants. |
@lukewagner Yes, its possible to disable the kernel fixups, and then install signal handler to catch the alignment errors in user-mode. (As @sunfishcode just said :) |
Ok, thanks. So given all the above, I'm still not seeing how nondeterministic fault-on-misaligned access would help out MIPS here. What is the desired codegen? |
I may be missing your point, but I don't see any nondeterminism here. Code known to be unaligned could use byte accesses and construct the larger words. Code presumed to be aligned but with rare unaligned accesses would just work. Code with frequent unaligned accesses would also work, but would have terrible performance. I would like to detect that case and fall back to byte accesses. I don't see this as essential for MVP, but desirable in the longer term, as we see how rare or common these unaligned accesses are. I think this mostly agrees with Dan's #105 (comment) |
@paul99 The big question being discussed in this issue is whether we should weaken the specified semantics of loads/stores from always Just Working to possibly faulting (i.e., the wasm app crashes). It sounds like that's not what you're asking for, though, which is good. |
Actually, it's possible I misread you. When you say "I would like to detect that case and fall back to byte accesses.", I assumed you meant "dynamically and transparently". That is, you'd somehow (user-mode signal handler? perf counter?) detect a lot of this misaligned access going on and then generate a new version of code that uses byte accesses and swap in this new code. Is that what you meant? |
Yes, this is exactly what I meant. From #105 (comment) there is statement "Obligatory aligned/unaligned distinction, with unaligned operations Always Working but possibly being slow". The aligned/unaligned distinction seems very valuable for MIPS, where knowing ahead that accesses will be unaligned will let us generate reasonable code for that case, and only the 'promised aligned but lied' case would give us the perf hit (which we could dynamically detect and generate replacement code). |
@paul99 Great, then it also sounds like you're happy with the current state of the design. Are there any more outstanding reasons to consider weaking the semantics of misaligned loads from Just Working or can we close this issue? |
@lukewagner Now I see where we were misunderstanding each other. In the spec I see no mention of All accesses require alignment to be specified. The discussions here seemed to all require a promise of the alignment intent, and then various methods for handling actual alignment differing from the promised alignment. Having such an attribute would be super helpful in generating the right code to start with for data that is a-priori known to be unaligned, and I had presumed that would exist. Then the exceptions to that can be dealt with dynamically/transparently. All cases would just work, but the places that would tank performance would be far fewer. But if there is no distinction on alignment, I would think it would be uniformly ignored, and no one would even notice the small performance hit on Intel, for example. |
@paul99 The first sentence of the alignment section says "Each linear memory access operation also has an immediate positive integer power of 2 alignment attribute." Is this not what you mean? |
@lukewagner My apologies, I previously misread the first paragraph. I now see that the alignment attribute is precisely what I was looking for. The whole section sounds good to me as written. Sorry for the churn! |
No problem, it's good to talk through these issues to make sure we understand. Think we can close this issue @titzer? |
Hi, @titzer asked me to comment for the ARM architecture, I work for ARM on V8. The current proposal works on all modern ARM cores: they either support unaligned accesses or mechanisms to enable emulation (with the OS support). The beginning of this thread touches on the behaviour of older ARM cores. Those cores (ARMv5 and earlier) have a peculiar behaviour which would introduce non determinism with this proposal. For example for a word load, if the address is not aligned the loaded values is rotated 8 times by the value of bits[1:0] of the address. And those cores can either be little or big endian. All those cores have been superseded a while ago so it may be acceptable to ignore those. Regards, |
Thanks for that information! So this also suggests we don't need any changes from what's in the spec. @titzer any other issues to consider before closing? |
No, it sounds good. I'm OK with keeping the high road and going for that I'll probably do some experiments when we've got some example workloads, On Thu, Aug 6, 2015 at 7:13 PM, Luke Wagner [email protected]
|
It seems that some ARM implementations may ignore the low order bits of unaligned memory accesses and thus round down to the next aligned address. That would mean that every access that the engine cannot prove is properly aligned would need a dynamic check (since these processors won't cause a hardware fault). That may be too slow or too much code.
Would it be reasonable to spec aligned/unaligned accesses thusly?
For both kinds of accesses we could specify a sanitizer mode that will trap on Load/Store[aligned=true](actually not aligned) and profile or warn on Load/Store[aligned=unknown](actually not aligned).
The above would allow the engine to omit checks for the [aligned=true] case, accepting whatever the hardware does, but still require it to emit checks for [aligned=unknown] on these processors.
The text was updated successfully, but these errors were encountered: