-
Notifications
You must be signed in to change notification settings - Fork 14
Update to newest Rust #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
After optimisation, LLVM converts the IR program into a Directed Acyclic Graph (DAG), where the nodes are the instructions, with subnodes being their operands, and so on. The instruction sequence:
Becomes
Or how it is written in LLVM, It then tries to match this DAG to the patterns specified in This error is simply saying that after instruction selection, LLVM was not able to find a match for the DAG
Where
Which can be read as
Basically this error is because |
We want to match The fix should be to change the Of course that fixes the unsigned integer situation, the same pattern needs to be added to the signed multiplication instruction I'll commit it now. Note that I also have an unpushed branch of LLVM trunk. It sounds like you're also working on that - shall I push it? |
Thanks for the great explanation. While I have an OK understanding of general compiler workings, I don't have much concrete experience. Being pointed to specific files helps me see where / how some of the changes could be made. I don't want to have to just report issues; I'd like to at least try to help fix them too 😉 .
I do have a local merged branch, but I don't see any problem in continuing to merge in new stuff you push. My hope is that I can help weed out some more cases like this as you are in the process of getting things merged into LLVM proper. |
I don't know what I'm doing yet, but I can see some patterns and deviations from those patterns... Why is |
And one more: why do |
Ok, |
LLVM places pretty much all of its target descriptions inside TableGen files, a DSL which compiles down to C files with the
For whatever reason, the Atmel engineers gave the other multiplication instructions a slightly different encoding (there are several cases of this sadly) so we need to work around it. In case you want to have a look, here is the instruction set manual. |
Yeah, I was looking at the datasheet to start with, which has an abbreviated view of the instructions and doesn't make note of the restrictions on source registers. I've since found the same detailed reference you posted. :-) |
From memory, they always place their product in the registers
|
If I'm reading the docs correctly, I think all 6 |
Ah, you're right. It's probably missing because we don't currently generate the other instructions yet (no patterns), it still should be though. |
Which probably explains why only |
Yeah |
Just to make sure, you haven't yet pushed this, correct? If you have, I'm not sure where — I'm checking the branch list to see when it pops up. |
No, I haven't pushed this. I'm hitting an assertion error which I'm working
|
|
Yeah I'm getting the same problem. I've just posted on the |
I'm following along there as best I can ;-) A bit of bright news: I got an LED to light up on a 100% / 50% / 1% duty cycle. Notably, that doesn't require any math to accomplish. In the meanwhile, I've copied out large parts of
However, if I keep all the code in one file, it compiles fine... |
Looks really good :) Yes that's a strange error.. Feel free to create an issue for that, because we're doing something wrong. If you post code, make sure to pass Basically we are able to pass the generated IR file from Rust into the |
Ah great, I knew there was a way to use the LLVM IR to finish the compilation, but didn't know what it was. And |
|
Well,
So that probably means I'm using it incorrectly. |
I just remembered that bugpoint doesn't actually check the output of the On Tue, Jan 19, 2016 at 10:13 AM, Jake Goulding [email protected]
|
Yup. I was letting it fail on any failure, not just the one I wanted. Now I've got a handle on it and submitted. I guess I should try to find the other failure too. |
Done. |
And I went ahead and opened another issue for the original multiplication issue. |
I've just updated LLVM to master (it requires updating LLVM, I left a |
@shepmaster can you give me the IR for the original multiplication (preferrable bugpoint'ed, but no worries) issue of this thread? If you look at |
Yup, not a problem. I merged
Good point. I've updated that issue. |
I went ahead and updated. While compiling LLVM seemed to succeed, I am getting issues with the Rust wrapper around LLVM. So that should be exciting to try and fix:
I expect I'll have to dig though the LLVM changes to see what needs to be modified. |
Closing due to this being super stale |
Add small-copy optimization for copy_from_slice ## Summary During benchmarking, I found that one of my programs spent between 5 and 10 percent of the time doing memmoves. Ultimately I tracked these down to single-byte slices being copied with a memcopy. Doing a manual copy if the slice contains only one element can speed things up significantly. For my program, this reduced the running time by 20%. ## Background I am optimizing a program that relies heavily on reading a single byte at a time. To avoid IO overhead, I read all data into a vector once, and then I use a `Cursor` around that vector to read from. During profiling, I noticed that `__memmove_avx_unaligned_erms` was hot, taking up 7.3% of the running time. It turns out that these were caused by calls to `Cursor::read()`, which calls `<&[u8] as Read>::read()`, which calls `&[T]::copy_from_slice()`, which calls `ptr::copy_nonoverlapping()`. This one is implemented as a memcopy. Copying a single byte with a memcopy is very wasteful, because (at least on my platform) it involves calling `memcpy` in libc. This is an indirect call when libc is linked dynamically, and furthermore `memcpy` is optimized for copying large amounts of data at the cost of a bit of overhead for small copies. ## Benchmarks Before I made this change, `perf` reported the following for my program. I only included the relevant functions, and how they rank. (This is on a different machine than where I ran the original benchmarks. It has an older CPU, so `__memmove_sse2_unaligned_erms` is called instead of `__memmove_avx_unaligned_erms`.) ``` #3 5.47% bench_decode libc-2.24.so [.] __memmove_sse2_unaligned_erms #5 1.67% bench_decode libc-2.24.so [.] memcpy@GLIBC_2.2.5 #6 1.51% bench_decode bench_decode [.] memcpy@plt ``` `memcpy` is eating up 8.65% of the total running time, and the overhead of dispatching to a specialized fast copy function (`memcpy@GLIBC` showing up) is clearly visible. The price of dynamic linking (`memcpy@plt` showing up) is visible too. After this change, this is what `perf` reports: ``` #5 0.33% bench_decode libc-2.24.so [.] __memmove_sse2_unaligned_erms #14 0.01% bench_decode libc-2.24.so [.] memcpy@GLIBC_2.2.5 ``` Now only 0.34% of the running time is spent on memcopies. The dynamic linking overhead is not significant at all any more. To add some more data, my program generates timing results for the operation in its main loop. These are the timings before and after the change: | Time before | Time after | After/Before | |---------------|---------------|--------------| | 29.8 ± 0.8 ns | 23.6 ± 0.5 ns | 0.79 ± 0.03 | The time is basically the total running time divided by a constant; the actual numbers are not important. This change reduced the total running time by 21% (much more than the original 9% spent on memmoves, likely because the CPU is stalling a lot less because data dependencies are more transparent). Of course YMMV and for most programs this will not matter at all. But when it does, the gains can be significant! ## Alternatives * At first I implemented this in `io::Cursor`. I moved it to `&[T]::copy_from_slice()` instead, but this might be too intrusive, especially because it applies to all `T`, not just `u8`. To restrict this to `io::Read`, `<&[u8] as Read>::read()` is probably the best place. * I tried copying bytes in a loop up to 64 or 8 bytes before calling `Read::read`, but both resulted in about a 20% slowdown instead of speedup.
Since I'm playing with this, I figured I'd document some of what I've been seeing.
I've merged rust/rust@1447ce78fbd65a629f228ec8731a5cddc076a15c into the
avr-support
branch. This entailed a few rounds of merge conflict resolution, as well as updating compiler-rt and the LLVM fork as well. The compiler seems to build, but I've hit some issues with compiling libcore:Specifically, LLVM hit a few assertions when
SelectionDAG::computeKnownBits
callsAPInt::trunc
,APInt::sext
, andAPInt::zext
because it's trying to convert a 16-bit number to a 16-bit number. These functions all expect to convert to a different size, and there are methods likezextOrSelf
that allow the same size. I hacked around that by returning*this
in all three methods (what could go wrong, right?).The next attempt failed at an LLVM issue:
Which I don't have the LLVM understanding to parse yet :-)
I'm going to try some other directions to see if I can get an executable.
The text was updated successfully, but these errors were encountered: