Skip to content

Update to newest Rust #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shepmaster opened this issue Jan 16, 2016 · 31 comments
Closed

Update to newest Rust #14

shepmaster opened this issue Jan 16, 2016 · 31 comments

Comments

@shepmaster
Copy link
Member

Since I'm playing with this, I figured I'd document some of what I've been seeing.

I've merged rust/rust@1447ce78fbd65a629f228ec8731a5cddc076a15c into the avr-support branch. This entailed a few rounds of merge conflict resolution, as well as updating compiler-rt and the LLVM fork as well. The compiler seems to build, but I've hit some issues with compiling libcore:

./x86_64-apple-darwin/stage1/bin/rustc -C opt-level=2 -Z no-landing-pads --target avr-atmel-none -g ../src/libcore/lib.rs --out-dir libcore-avr/

Specifically, LLVM hit a few assertions when SelectionDAG::computeKnownBits calls APInt::trunc, APInt::sext, and APInt::zext because it's trying to convert a 16-bit number to a 16-bit number. These functions all expect to convert to a different size, and there are methods like zextOrSelf that allow the same size. I hacked around that by returning *this in all three methods (what could go wrong, right?).

The next attempt failed at an LLVM issue:

LLVM ERROR: Cannot select: t35: i8 = mulhs t2, t4
  t2: i8,ch = CopyFromReg t0, Register:i8 %vreg27
    t1: i8 = Register %vreg27
  t4: i8,ch = CopyFromReg t0, Register:i8 %vreg25
    t3: i8 = Register %vreg25
In function: _ZN3num14from_str_radix20h7837196669301488416E

Which I don't have the LLVM understanding to parse yet :-)

I'm going to try some other directions to see if I can get an executable.

@dylanmckay
Copy link
Member

After optimisation, LLVM converts the IR program into a Directed Acyclic Graph (DAG), where the nodes are the instructions, with subnodes being their operands, and so on.

The instruction sequence:

%0 = sub i32 1, i32 5
%1 = mul i32 7, i32 8
add %0, %1

Becomes

        sub i32 1, i32 5
    /
add
    \
        mul i32 7, i32 8

Or how it is written in LLVM, (add (sub i32 1, i32 5), (mul i32 7, i32 8)).

It then tries to match this DAG to the patterns specified in AVRInstrInfo.td. If a match is not found, the nodes are "expanded" (example: into two 16 bit add instructions) or "custom lowered" using a hook. It keeps doing this, until a match is found. We then have our mapping from IR instructions to AVR instructions.

This error is simply saying that after instruction selection, LLVM was not able to find a match for the DAG

mulhs t2, t4

Where

  t2: i8,ch = CopyFromReg t0, Register:i8 %vreg27
    t1: i8 = Register %vreg27

  t4: i8,ch = CopyFromReg t0, Register:i8 %vreg25
    t3: i8 = Register %vreg25

Which can be read as

mulhs COPYFROMREG %vreg27, COPYFROMREG %vreg25

COPYFROMREG is one of the node types specified in this file. There are node types for every IR instruction, and then more machine specific operations (like multiply two numbers and return a product with twice the precision, which is the behaviour of AVR's mul instructions). It would be the expansion process which converted the mul instruction into UMUL_LOHI, and then expanding that into mulhs ... because there were no matches.

Basically this error is because clang generates an instruction sequence which we haven't yet defined a pattern (or a custom lowering hook) for mulhs COPYFROMREG %vreg27, COPYFROMREG %vreg25. This is simply a bug in the fact AVR-LLVM hasn't implemented this code path yet.

@dylanmckay
Copy link
Member

We want to match UMUL_LOHI inside the pattern for the AVR mul instruction here, and the only pattern defined for that instruction is commented out, which will be the source of our problem.

The fix should be to change the smullohi node ID to umul_lohi. I don't believe the smullohi node exists any longer, which is probably why it is commented out. It was probably renamed to umul_lohi at some point.

Of course that fixes the unsigned integer situation, the same pattern needs to be added to the signed multiplication instruction muls, but with smul_lohi instead.

I'll commit it now. Note that I also have an unpushed branch of LLVM trunk. It sounds like you're also working on that - shall I push it?

@shepmaster
Copy link
Member Author

Thanks for the great explanation. While I have an OK understanding of general compiler workings, I don't have much concrete experience. Being pointed to specific files helps me see where / how some of the changes could be made. I don't want to have to just report issues; I'd like to at least try to help fix them too 😉 .

I'll commit it now. Note that I also have an unpushed branch of LLVM trunk. It sounds like you're also working on that - shall I push it?

I do have a local merged branch, but I don't see any problem in continuing to merge in new stuff you push.

My hope is that I can help weed out some more cases like this as you are in the process of getting things merged into LLVM proper.

@shepmaster
Copy link
Member Author

I don't know what I'm doing yet, but I can see some patterns and deviations from those patterns...

Why is MULRdRr a FRdRr, but all the other *MUL* are FMUL2RdRr or FFMULRdRr? Likewise, why are only MULRdRr and MULSRdRr inside the usesCustomInserter block?

@shepmaster
Copy link
Member Author

And one more: why do smullohi and umullohi get special entries in TargetSelectionDAG, but signed x unsigned and the fractional ones don't?

@shepmaster
Copy link
Member Author

Ok, mul, muls and mulsu all support differing sets of registers. mul supports all of them, the others have hard-coded values in those places of the opcode so that explains the difference there.

@dylanmckay
Copy link
Member

LLVM places pretty much all of its target descriptions inside TableGen files, a DSL which compiles down to C files with the .inc extension. Conveniently it supports inheritance.

FRdRr is the format of most instructions which take two 8 bit GPRs. MULRdRr is one of these.

For whatever reason, the Atmel engineers gave the other multiplication instructions a slightly different encoding (there are several cases of this sadly) so we need to work around it.

In case you want to have a look, here is the instruction set manual.

@shepmaster
Copy link
Member Author

here is the instruction set manual.

Yeah, I was looking at the datasheet to start with, which has an abbreviated view of the instructions and doesn't make note of the restrictions on source registers. I've since found the same detailed reference you posted. :-)

@dylanmckay
Copy link
Member

Why are only MULRdRr and MULSRdRr inside the usesCustomInserter block?

From memory, they always place their product in the registers r1:r0. r1 is expected to be zero (the calling convention I think) so the custom inserter clears it after the product has been copied out.

The other multiplication instructions allow you to specify the output register I believe. wrong, see below

@shepmaster
Copy link
Member Author

The other multiplication instructions allow you to specify the output register I believe.

If I'm reading the docs correctly, I think all 6 *mul* instructions place the results in r1:r0.

@dylanmckay
Copy link
Member

If I'm reading the docs correctly, I think all 6 mul instructions place the results in r1:r0.

Ah, you're right. It's probably missing because we don't currently generate the other instructions yet (no patterns), it still should be though.

@shepmaster
Copy link
Member Author

we don't currently generate the other instructions yet (no patterns),

Which probably explains why only smullohi and umullohi are in TargetSelectionDAG — the others can't be generated!

@dylanmckay
Copy link
Member

Yeah TargetSelectionDAG nodes are pretty much added whenever someone needs one.

@shepmaster
Copy link
Member Author

I'll commit it now. Note that I also have an unpushed branch of LLVM trunk. It sounds like you're also working on that - shall I push it?

Just to make sure, you haven't yet pushed this, correct? If you have, I'm not sure where — I'm checking the branch list to see when it pops up.

@dylanmckay
Copy link
Member

No, I haven't pushed this. I'm hitting an assertion error which I'm working
on.
On 17 Jan 2016 16:34, "Jake Goulding" [email protected] wrote:

I'll commit it now. Note that I also have an unpushed branch of LLVM
trunk. It sounds like you're also working on that - shall I push it?

Just to make sure, you haven't yet pushed this, correct? If you have, I'm
not sure where — I'm checking the branch list
https://github.com/avr-llvm/llvm/branches to see when it pops up.


Reply to this email directly or view it on GitHub
#14 (comment).

@shepmaster
Copy link
Member Author

an assertion error

Assertion failed: (Ops.size() >= NumSrcResults && "Didn't provide enough results"), function EmitResultCode, by any chance? I see that SDTIntBinHiLoOp is defined as taking 2 and returning 2 arguments, and I see a bunch of code for the normal divrem methods, but haven't yet put together where the missing code for mullohi should go.

@dylanmckay
Copy link
Member

Yeah I'm getting the same problem. I've just posted on the llvm-dev mailing list.

@shepmaster
Copy link
Member Author

I've just posted on the llvm-dev mailing list.

I'm following along there as best I can ;-)

A bit of bright news: I got an LED to light up on a 100% / 50% / 1% duty cycle. Notably, that doesn't require any math to accomplish.

In the meanwhile, I've copied out large parts of libcore directly into my code and been able to set up an timer interrupt handler in Rust that toggles the LED every ~64ms. This has the interesting problem that if I try to extract the subset of libcore to another crate, I get this error:

Impossible reg-to-reg copy
UNREACHABLE executed at avr-rust/src/llvm/lib/Target/AVR/AVRInstrInfo.cpp:62!

However, if I keep all the code in one file, it compiles fine...

@dylanmckay
Copy link
Member

Looks really good :)

Yes that's a strange error..

Feel free to create an issue for that, because we're doing something wrong. If you post code, make sure to pass -emit llvm (or whatever the switch is) to rustc. An interesting tool included in LLVM is the bugpoint tool, which can automatically minimize an IR asssembly test case (in some cases). There are some good blog posts about it.

Basically we are able to pass the generated IR file from Rust into the llc tool, which is equivalent to directly compiling it with Rust.

@shepmaster
Copy link
Member Author

Ah great, I knew there was a way to use the LLVM IR to finish the compilation, but didn't know what it was. And bugpoint sounds useful, as I'm sure there's a whole bunch of noise in this code with 1/3 of libcore in it.

@dylanmckay
Copy link
Member

llc -march=avr -mcpu=atmega328p -filetype=obj <in> -o <out> should work. Note that you can also pass -filetype=asm to view AVR assembly.

bugpoint sometimes works really well, but sometimes it does pretty much nothing. If you set up a bash script to simply call out to llc, you can pass it into bugpoint and it'll keep minimizing, while ensuring the code it creates still generates the same error.

@shepmaster
Copy link
Member Author

Well, bugpoint seems to have reduced to a different error:

Assertion failed: (InVals.size() == Ins.size() && "LowerFormalArguments didn't emit the correct number of values!")

So that probably means I'm using it incorrectly.

@dylanmckay
Copy link
Member

I just remembered that bugpoint doesn't actually check the output of the
script it runs, just whether it passed or failed. You're likely using it
correctly, and now have found two bugs ;)

On Tue, Jan 19, 2016 at 10:13 AM, Jake Goulding [email protected]
wrote:

Well, bugpoint seems to have reduced to a different error:

Assertion failed: (InVals.size() == Ins.size() && "LowerFormalArguments didn't emit the correct number of values!")

So that probably means I'm using it incorrectly.


Reply to this email directly or view it on GitHub
#14 (comment).

@shepmaster
Copy link
Member Author

I'm using it incorrectly.

Yup. I was letting it fail on any failure, not just the one I wanted. Now I've got a handle on it and submitted. I guess I should try to find the other failure too.

@shepmaster
Copy link
Member Author

find the other failure too.

Done.

@shepmaster
Copy link
Member Author

And I went ahead and opened another issue for the original multiplication issue.

@dylanmckay
Copy link
Member

I've just updated LLVM to master (it requires updating LLVM, clang, and compiler-rt) and ensuring they all work together.

I left a avr-rust branch on avr-llvm/llvm which points to the avr-support branch you were using previously, although the specific commit is likely stored in your .gitmodules file so it shouldn't be a problem.

@dylanmckay
Copy link
Member

@shepmaster can you give me the IR for the original multiplication (preferrable bugpoint'ed, but no worries) issue of this thread? If you look at test/CodeGen/AVR/mul.ll (which passes), we currently do support multiplication, so there must be another underlying issue.

@shepmaster
Copy link
Member Author

so it shouldn't be a problem.

Yup, not a problem. I merged avr-support and avr-rust-support locally to get the newest stuff, plus some hacks of my own. Whenever there's progress I should be able to do the same again. Or did you want me to try updating and see if errors had changed?

the IR for the original multiplication

Good point. I've updated that issue.

@shepmaster
Copy link
Member Author

I've just updated LLVM to master

I went ahead and updated. While compiling LLVM seemed to succeed, I am getting issues with the Rust wrapper around LLVM. So that should be exciting to try and fix:

avr-rust/src/rustllvm/RustWrapper.cpp:351:26: error: no matching member function for call to 'createFunction'
    return wrap(Builder->createFunction(
                ~~~~~~~~~^~~~~~~~~~~~~~
avr-rust//src//llvm//include/llvm/IR/DIBuilder.h:520:19: note: candidate function not viable: requires at most 13 arguments, but 14 were provided
    DISubprogram *createFunction(DIScope *Scope, StringRef Name,
                  ^
avr-rust//src//llvm//include/llvm/IR/DIBuilder.h:540:19: note: candidate function not viable: requires at most 13 arguments, but 14 were provided
    DISubprogram *createFunction(DIScopeRef Scope, StringRef Name,
                  ^
avr-rust/src/rustllvm/RustWrapper.cpp:834:17: error: no member named 'LinkModules' in 'llvm::Linker'
    if (Linker::LinkModules(Dst, Src->get(), [&](const DiagnosticInfo &DI) { DI.print(DP); })) {
        ~~~~~~~~^

I expect I'll have to dig though the LLVM changes to see what needs to be modified.

@dylanmckay
Copy link
Member

Closing due to this being super stale

shepmaster pushed a commit that referenced this issue Dec 1, 2016
Add small-copy optimization for copy_from_slice

## Summary

During benchmarking, I found that one of my programs spent between 5 and 10 percent of the time doing memmoves. Ultimately I tracked these down to single-byte slices being copied with a memcopy. Doing a manual copy if the slice contains only one element can speed things up significantly. For my program, this reduced the running time by 20%.

## Background

I am optimizing a program that relies heavily on reading a single byte at a time. To avoid IO overhead, I read all data into a vector once, and then I use a `Cursor` around that vector to read from. During profiling, I noticed that `__memmove_avx_unaligned_erms` was hot, taking up 7.3% of the running time. It turns out that these were caused by calls to `Cursor::read()`, which calls `<&[u8] as Read>::read()`, which calls `&[T]::copy_from_slice()`, which calls `ptr::copy_nonoverlapping()`. This one is implemented as a memcopy. Copying a single byte with a memcopy is very wasteful, because (at least on my platform) it involves calling `memcpy` in libc. This is an indirect call when libc is linked dynamically, and furthermore `memcpy` is optimized for copying large amounts of data at the cost of a bit of overhead for small copies.

## Benchmarks

Before I made this change, `perf` reported the following for my program. I only included the relevant functions, and how they rank. (This is on a different machine than where I ran the original benchmarks. It has an older CPU, so `__memmove_sse2_unaligned_erms` is called instead of `__memmove_avx_unaligned_erms`.)

```
#3   5.47%  bench_decode  libc-2.24.so      [.] __memmove_sse2_unaligned_erms
#5   1.67%  bench_decode  libc-2.24.so      [.] memcpy@GLIBC_2.2.5
#6   1.51%  bench_decode  bench_decode      [.] memcpy@plt
```

`memcpy` is eating up 8.65% of the total running time, and the overhead of dispatching to a specialized fast copy function (`memcpy@GLIBC` showing up) is clearly visible. The price of dynamic linking (`memcpy@plt` showing up) is visible too.

After this change, this is what `perf` reports:

```
#5   0.33%  bench_decode  libc-2.24.so      [.] __memmove_sse2_unaligned_erms
#14  0.01%  bench_decode  libc-2.24.so      [.] memcpy@GLIBC_2.2.5
```

Now only 0.34% of the running time is spent on memcopies. The dynamic linking overhead is not significant at all any more.

To add some more data, my program generates timing results for the operation in its main loop. These are the timings before and after the change:

| Time before   | Time after    | After/Before |
|---------------|---------------|--------------|
| 29.8 ± 0.8 ns | 23.6 ± 0.5 ns |  0.79 ± 0.03 |

The time is basically the total running time divided by a constant; the actual numbers are not important. This change reduced the total running time by 21% (much more than the original 9% spent on memmoves, likely because the CPU is stalling a lot less because data dependencies are more transparent). Of course YMMV and for most programs this will not matter at all. But when it does, the gains can be significant!

## Alternatives

* At first I implemented this in `io::Cursor`. I moved it to `&[T]::copy_from_slice()` instead, but this might be too intrusive, especially because it applies to all `T`, not just `u8`. To restrict this to `io::Read`, `<&[u8] as Read>::read()` is probably the best place.
* I tried copying bytes in a loop up to 64 or 8 bytes before calling `Read::read`, but both resulted in about a 20% slowdown instead of speedup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants