Skip to content

[MacroFusion][RISCV] Make commutable instructions fusible as possible #85214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wangpc-pp opened this issue Mar 14, 2024 · 4 comments
Closed

Comments

@wangpc-pp
Copy link
Contributor

wangpc-pp commented Mar 14, 2024

In #82738, the user raised an issue about commutable instructions.
I have created a PR (#82751) to fix it, but it comes another problem: what if the μ-arch can fuse the instruction pair only if rs1 matches?
For example, for slli+add fusion (see https://godbolt.org/z/xo6xEoPKe):

slli r0, r0, 1
add r0, r0, r1 // fusible

slli r0, r0, 1
add r0, r1, r0 // not fusible

The μ-arch may not check rs2, but add is commutable, we can't guarantee that r0 will always be in rs1.
In InstCombine, we have InstCombinerImpl::SimplifyAssociativeOrCommutative:

/// This performs a few simplifications for operators that are associative or
/// commutative:
///
///  Commutative operators:
///
///  1. Order operands such that they are listed from right (least complex) to
///     left (most complex).  This puts constants before unary operators before
///     binary operators.

But do we need or have a way in backend to change the operands order of second instruction when it's commutable?
cc @dtcxzyw @topperc

@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 14, 2024

@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 14, 2024

the user raised a issue about commutable instuctions.

TBH it would be easy to support it in the hardware. I am not sure whether it is a real issue if the user is not a RTL design engineer.

@poemonsense
Copy link

A few comments, personally as a XiangShan developer:

  1. For communtable operations, checking both rs1 and rs2 brings minor overhead to the uarch (one more 5-bit comparison, plus an OR logic for the fusion detection, and one more Mux for either the operand index or value). However, the minor overhead may bring performance improvements if the compiler does schedule the operand to both rs1 and rs2. Also, the decode logic generally isn't the timing/area bottleneck for most RISC CPUs. So I think most uarch designers would choose to check both rs1 and rs2 for this tradeoff between potential performance and hardware complexity.

  2. If every compiler schedules the operand to a fixed position, either rs1 or rs2, it then becomes unnecessary for the hardware uarch to check both rs1 and rs2. I'm not sure whether this is the case now. However, if the compiler does schedule it statically in rs1 or rs2, probably the compiler community should let the hardware community know to avoid over-design in uarch.

  3. For non-commutative operations, the overhead may be a little bit more significant. But I don't think this would be the normal cases, where rs1 and rs2 can both expected to be fused, because these operations normally have their meanfully usages in real use cases, such as address computation.

@wangpc-pp
Copy link
Contributor Author

wangpc-pp commented Mar 14, 2024

Thanks for your valuable reply!
This issue came to my mind when I was implementing #82751, and actually, I don't know if there is a real hardware implementation that doesn't check rs2 (kick me if there is).

However, if the compiler does schedule it statically in rs1 or rs2

No, the compiler can't control it all. So, there is no need to worry about over-design. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants