Skip to content

Maximum alignment? #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jfbastien opened this issue Jan 26, 2016 · 34 comments
Closed

Maximum alignment? #217

jfbastien opened this issue Jan 26, 2016 · 34 comments
Milestone

Comments

@jfbastien
Copy link
Member

What's the maximum alignment for load/store?

We should probably specify something, e.g. log2(address_space_bytes), and test for it.

@mbodart
Copy link

mbodart commented Jan 26, 2016

By "test for it", I assume you mean just a range check on the alignment value.

If I'm interpreting AstSemantics.md correctly, the alignment is just a performance hint, not a "trappable" guarantee. Is this a correct interpretation?

@jfbastien
Copy link
Member Author

Correct, check it in the implementation, and have corresponding tests in the spec test suite.

Your interpretation is correct.

@ghost
Copy link

ghost commented Jan 26, 2016

This came up before somewhere, and I recall some support for the page size being the maximum. If the minimum is one then this could be encoded in 4 bits. But perhaps 256 is also a reasonable limit, encoded in 3 bits, or even 16 encoded in 2 bits. The v8 encoding only has one bit for the alignment - the natural alignment flag.

@kg
Copy link
Contributor

kg commented Jan 26, 2016

Don't bother worrying about bitpacking. Every stream compressor will do it for you. :-)

@ghost
Copy link

ghost commented Jan 26, 2016

@kg That's not my experience. It helps the compressor to pack and avoid unnecessary redundancy. Even brotli only looks back one or two bytes. We should decide on the limits, the lower limit seems to be one and the upper limit at most the page size. We are only interested in powers of 2, so only encode the power of two. The gives fours bits at most. We might be able to do even better given knowledge of the operations natural alignment - brotli only looks at 6 prior bits and might not even notice the difference between the opcodes to focus the codes.

@ghost
Copy link

ghost commented Jan 26, 2016

Sorry, if the lower limit is one, that would be 2^0 so four bits only gets to 2^15 which is half of the wasm page size. Similar error in the other suggestions.

@kg
Copy link
Contributor

kg commented Jan 26, 2016

@kg That's not my experience. It helps the compressor to pack and avoid unnecessary redundancy. Even brotli only looks back one or two bytes. We should decide on the limits, the lower limit seems to be one and the upper limit at most the page size. We are only interested in powers of 2, so only encode the power of two. The gives fours bits at most. We might be able to do even better given knowledge of the operations natural alignment - brotli only looks at 6 prior bits and might not even notice the difference between the opcodes to focus the codes.

I tested this specific question and talked to one of the designers of Brotli. Once you split streams, the alignment values will be a stream of 1 (or 4) byte values which can be arithmetic coded or otherwise represented very efficiently. If we bitpack the alignment into 3-5 bits with other unrelated data in the other bits, what we're actually doing is destroying some of the potential for the compressor to efficiently encode either chunk of data.

Encoding the alignment as a power of two is a perfectly reasonable idea, but if you cap the alignment at some arbitrary threshold so you can pack bits, you may regret it. Vectors >256bytes in size are an inevitability, as are user-defined structures of size >256bytes.

@ghost
Copy link

ghost commented Jan 26, 2016

@kg I really doubt that is correct, but we can settle it in a test. Here's why: most of the alignments will be the natural alignment. In a separate stream they will jump around and there is no context. Whereas in inline there is the context of the operator which could predict the natural alignment most of the time, most of the time it would be just one bit. What we might be able to do is optimize the encoding to be 0 for the natural alignment, then when packed in a separate stream it would compress very well.

@kg
Copy link
Contributor

kg commented Jan 26, 2016

@kg I really doubt that is correct, but we can settle it in a test.

I tested it. It's in js-astcompressor, and you can also see some general things to this effect in the JSZap paper. If you are confused I would be happy to explain it to you in depth offline.

In a separate stream they will jump around and there is no context. Whereas in inline there is the context of the operator which could predict the natural alignment most of the time, most of the time it would be just one bit.

This does bring up an important detail. If your stream splitting is coarse-grained, like 'here is all the uint8s', you're correct that the values will be harder to predict. That's not the right way to do stream splitting (though it is, in fact, better than nothing). You want to split streams semantically, so your individual streams are things like 'opcodes', 'int8 constant immediates' and 'memory load alignments'.

@ghost
Copy link

ghost commented Jan 26, 2016

@kg I think I understand your point, but have not explore it. Here's a test.

  • A: split out all the load/store alignments into a separate binary file, one byte for each alignment, e.g. hex dump 04 04 04 04 04 04 ... zlib benchmark. Compressed to max with brotli gives 393 bytes.
  • B: split out all the natural alignments as zero and non-natural as-is, one byte for each. Same compression gives 13 bytes.

The context helps. Another option would be to keep the one bit flag, in the v8 encoding, and have a separate stream for the non-natural cases, and in that case since most of the alignments are natural this side stream would be very small.

@mbodart
Copy link

mbodart commented Jan 26, 2016

Is there any data to show that natural alignment is common?

Yes, data is typically naturally aligned when allocated. But given a pointer parameter, say "int *p",
a C/C++ compiler often cannot guarantee that p is well aligned as it most likely hasn't seen
its allocation. Are we expecting C/C++ compilers to optimistically mark references
with natural alignment, and only specify lesser alignment when that is clear to them?

Given that alignment is only a hint, it's utility is not clear to me.

@sunfishcode
Copy link
Member

We do expect C/C++ compilers to optimistically mark references with natural alignment, even when they haven't seen the allocation, and only specify lesser alignments with that is clear to them. Natural alignment is the overwhelming majority.

@kg
Copy link
Contributor

kg commented Jan 26, 2016

@mbodart

Given that alignment is only a hint, it's utility is not clear to me.

The importance of the alignment hint is largely for the polyfill, but it matters on some native architectures too: If we want to do unaligned reads/writes in asm.js we have to jump through some obscene hoops to make them work. So if we don't know the alignment status of any memory operations, we eat a performance hit on all memory traffic.

The alignment hint allows a compiler to say 'it's okay to break me if I lie about my alignment' in the common case so that you only get bad performance in the spots where it's needed.

I agree with @sunfishcode that natural alignment is probably the overwhelming majority in practice. I do think it's the case that some compilers will not trivially be able to flag loads/stores as aligned without doing a full global analysis, though...

@mbodart
Copy link

mbodart commented Jan 26, 2016

I think we're in agreement, as long as when you say "it's okay to break me" you really mean "it's okay to lessen my performance, but not crash".

I'm not so concerned about MVP. But when simd comes along, the current alignment definition does not allow an engine to choose movaps instead of movups (on X86), as movaps would fault if unaligned. This may be moot as movups performs well on more recent cpus when the data is aligned.

@sunfishcode
Copy link
Member

I believe wasm's current alignment hints do permit implementations to use movaps for SIMD, provided they're capable of catching the hardware trap and fixing things up.

@ghost
Copy link

ghost commented Mar 27, 2016

This seems to have been resolved in the binary format which has a maximum alignment of the access natural alignment, or is it still open to discussion to increased this maximum?

@lukewagner
Copy link
Member

Agreed.

@sunfishcode
Copy link
Member

sunfishcode commented Apr 22, 2016

This is not yet implemented in ml-proto. For example, memory.wast tests that (module (memory 0) (func (i32.load align=8 (i32.const 0)))) validates.

@sunfishcode sunfishcode reopened this Apr 22, 2016
@sunfishcode sunfishcode added this to the MVP milestone Apr 22, 2016
@dschuff
Copy link
Member

dschuff commented Apr 22, 2016

Actually, would it make sense to at least allow say 4 or 8, even for smaller-sized memory references? I could totally imagine there being plenty of e.g. byte-sized loads with alignments of 4, and I don't see any reason to pessimize those. Is there some particular reason the binary format doesn't allow that?

@dschuff
Copy link
Member

dschuff commented Apr 22, 2016

(By the way I noticed that @sunfishcode committed an LLVM patch limiting the alignment on loads and stores to natural alignment. I guess that was just so that its output could be properly encoded with the 0xb format today?)

@flagxor
Copy link
Member

flagxor commented Apr 22, 2016

Was size savings the only argument for capping at natural alignment?
Seems a shame to throw away information we have in LLVM.
Though I admit implementations probably won't get around to using it for a bit.
Dan, Ben, others, is there any particular reason not to allow up to page size?

@titzer
Copy link
Contributor

titzer commented Apr 25, 2016

As a hint, it can't do any damage.

@jfbastien
Copy link
Member Author

Could we allow alignment hints on load / store up to page size?

@kripken
Copy link
Member

kripken commented Apr 25, 2016

Could that be encoded in a way that doesn't increase the binary size noticeably?

@jfbastien
Copy link
Member Author

If it's frequent enough to matter then macro-compression will pick it up. If a user is OK dropping hints to lose potential perf then it can be stripped away.

The hint is useful in guiding speculative optimizations by the JIT (test and recompile, otherwise main code assumes alignment is correct). You can still do that speculation without the hint, but why throw away information LLVM has?

@titzer
Copy link
Contributor

titzer commented Apr 25, 2016

On Mon, Apr 25, 2016 at 6:37 PM, JF Bastien [email protected]
wrote:

If it's frequent enough to matter then macro-compression will pick it up.
If a user is OK dropping hints to lose potential perf then it can be
stripped away.

The hint is useful in guiding speculative optimizations by the JIT (test
and recompile, otherwise main code assumes alignment is correct). You can
still do that speculation without the hint, but why throw away information
LLVM has?

The question is whether encoding page size alignment (a 3-byte LEB) is
worth it, versus just specifying natural alignment.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#217 (comment)

@kripken
Copy link
Member

kripken commented Apr 25, 2016

If it's frequent enough to matter then macro-compression will pick it up.

Sure, but perhaps to the detriment of other things the compression could have reduced. I think this is worth measuring.

@sunfishcode
Copy link
Member

What kinds of optimizations would supernatural alignment hints enable? LLVM already does the main alignment-related optimizations, such as merging adjacent memory accesses into larger accesses.

@jfbastien
Copy link
Member Author

What kinds of optimizations would supernatural alignment hints enable? LLVM already does the main alignment-related optimizations, such as merging adjacent memory accesses into larger accesses.

Depends on the architecture. The basic one is to speculate is-aligned, vectorize or merge some operations but avoid scalar entry+exit loops as well as isn't-aligned loop. Test-and-interp on failure, recompile if too often. Less code is executed dynamically and less is generated statically. That's paid off in my experience.

@sunfishcode
Copy link
Member

sunfishcode commented Apr 26, 2016

@jfbastien LLVM already does that optimization. If it has a bunch of sequential memory accesses and it knows the alignment of the whole group, and it's not too much work to marshal the data around for a merged access, it'll merge them itself.

@jfbastien
Copy link
Member Author

@jfbastien LLVM already does that optimization. If it has a bunch of sequential memory accesses and it knows the alignment of the whole group, and it's not too much work to marshal the data around for a merged access, it'll merge them itself.

I don't think we're talking of the same thing: I'm talking about speculative optimizations which lead to recompilation on speculation failure. One of the big costs of speculation is when the optimizer is over-eager and assumes things that aren't true, that's where the extra info would pay off.

@sunfishcode
Copy link
Member

LLVM's own alignment information is never speculative in nature. At the LLVM level, it's full-octopus UB if the alignment isn't correct.

In general, wasm's alignment hint works well for these kinds of "highly confident" cases. If we change wasm's alignment hint to be used for more speculative use cases, such as humans decorating their own code, then VM's will have to be more conservative with it, which would weaken it for the high-confidence use cases.

@jfbastien
Copy link
Member Author

LLVM's own alignment information is never speculative in nature. At the LLVM level, it's full-octopus UB if the alignment isn't correct.

Indeed, but wasm can't take that for granted. Using that information for speculation will therefore always be correct, unless the developer gives incorrect information.

In general, wasm's alignment hint works well for these kinds of "highly confident" cases. If we change wasm's alignment hint to be used for more speculative use cases, such as humans decorating their own code, then VM's will have to be more conservative with it, which would weaken it for the high-confidence use cases.

That's not what I'm suggesting: I'm saying that the VM can speculate because the developer hinted that the alignment value was correct. These types of high-confidence cases are exactly where we want to avoid throwing away information.

@sunfishcode
Copy link
Member

If the producer has high confidence, it can just do the optimization itself. LLVM already does this. Other optimizing compilers commonly do such things. It wouldn't be too difficult to implement this in a relatively lightweight JIT library. Doing this on the producer side reduces time, complexity, and unpredictable performance on the consumer side, so it's worth encouraging.

While speculative optimizations with bailouts and recompilations may be easy to support in some of today's JS-engine-based wasm implementations, there are advantages to building wasm in a way that minimizes its dependence on them to achieve good performance. Greater alignment hints to support greater speculative optimization is one path we might take here, but there are other paths we can take here as well, from a broader perspective.

@lukewagner
Copy link
Member

lukewagner commented Apr 27, 2016

The question is whether encoding page size alignment (a 3-byte LEB) is worth it, versus just
specifying natural alignment.

@titzer Since alignment in flags is currently encoded as log2(alignment) 64k-alignment would still fit in 1 byte.

One thing that bugs me about the current design is that the number of reserved bits depends on the load/store width. That seems bug-prone for everyone involved; it means that if we, e.g., use the bits after for ordering annotations that they start at a different bit for i32.load than i64.load. Simply reserving a fixed number of bits for the max alignment we think we might ever need seems simpler. 2 bits only buys 8-byte alignment, which is already too small for i32x4; 3 bits bumps that way up to 128-byte alignment but that is "only" 2x bigger than AVX-512 so it still seems shortsighted. 4 bits provides 32k-alignment which seems like truly enough alignment for anyone. So how about just fixing alignment to 4 bits in flags?

(I'm pretty dubious about autovectorization opportunities in wasm, though. For one thing, I think it benefits the whole platform if wasm engines aren't trying to do fancy optimizations since these add cost (time and predictability) and, once the race begins, engines can be forced into a situation where they end up hurting the common case for the benefit of benchmarketting; we've seen this with JS. Instead, we should strive to expose the underlying mechanism (viz., SIMD ops) so that wasm generators can be explicit about what they want and not rely on magic. So I'd wish we could strike autovectorization as an argument for super-natural alignment (or anything else, for that matter).)

@ghost
Copy link

ghost commented Apr 27, 2016

@lukewagner Four bits was my suggestion way above too, and could you please also consider assigning the natural alignment the value zero to help compression too.

@lukewagner
Copy link
Member

If we can fix immediates in the opcode table (we're working on a prototype impl so we can make a real proposal soon), there won't be any immediate in the common case, so it seems like we won't have to worry about compression effects of the immediate.

@rossberg
Copy link
Member

Check now implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants