-
Notifications
You must be signed in to change notification settings - Fork 13.4k
clang: _mm512_reduce_add_ps lowers to LLVM IR that does not reflect correct reduce order #82813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-x86 Author: Ralf Jung (RalfJung)
This
```C
#include <immintrin.h>
float foo(__m512 x) {
return _mm512_reduce_add_ps(x);
}
```
[produces](https://godbolt.org/z/qera4378s)
```
define dso_local noundef float @foo(float vector[16])(<16 x float> noundef %x) local_unnamed_addr #0 {
entry:
%0 = tail call reassoc noundef float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x)
ret float %0
}
```
According to the [LangRef](https://llvm.org/docs/LangRef.html#fast-math-flags), the `reassoc` here means that the addition may happen in *any* order, which is not what [Intel documents](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_reduce_add_ps&expand=133&ig_expand=5303) -- they specify a particular, "tree-like" order.
Even worse, we can chain two of these operations: #include <immintrin.h>
float foo(__m512 x) {
float xr = _mm512_reduce_add_ps(x);
__m512 y = _mm512_set_ps(
xr, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0,
0.0, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0
);
return _mm512_reduce_add_ps(y);
} Now the second addition may be arbitrarily re-associated with the first one. As far as I understand, there's nothing about _mm512_reduce_add_ps should probably either use a vendor-specific intrinsic, or LLVM IR needs a version of |
Presumably this applies to How about IR intrinsics Oh, there's also |
Do you get unexpected result against the document? We lower this intrinsic in "tree-like" order in the backend when |
I haven't checked. It doesn't matter. LLVM IR is an abstraction in its own right with its own semantics, and a lot of non-trivial infrastructure relying on those semantics. The clang frontend generates LLVM IR that does not properly capture its intent. It may happen to be the case that the backend currently does the right thing and no optimization pass disturbs these simple examples, but that's a fragile situation to rely on. It's akin to generating LLVM IR with UB and arguing "yeah but current optimizations don't exploit this UB so it's fine". For instance, an LLVM IR pass could just remove |
The In a word, allowing targets to choose their most efficient lowering with |
We don't want the "most efficient" lowering, we want the exact lowering Intel documents. Maybe that's the same today but who knows what happens in the future. And Your argument also this relies on fast-math passes never using So, overall it sounds like you are suggesting to overload |
Looking back at #46850 when this was last discussed, the Intel docs then described the reduction as a scalar expansion:
but it now describes it as this which matches the llvm expansion:
|
Wait, Intel changed their spec? oO |
I'm not sure (it was 3 years ago!), I'm going off my comments in the previous issue. |
We switched the implementation from x86 specific intrinsic to LLVM reduction intrinsic at the time, but the behavior is never changed. The document didn't describe the "tree-like" reduction precisely, so we updated the doc too.
They are not conflict. "most efficient" means different lowering for different targets, for X86, they are constant at the moment. It's true we may introduce new instructions in different orders in the future and they may share the same interface. But it is only a hypothetical problem to the future.
I don't see the flag generates unexpected result in multiple "reduce" operations, through
No, I didn't mean it has a second meaning. We use |
You are not.
As I said above, that doesn't matter. It just means today's optimizations don't do this. But the point of LLVM IR having a LangRef is that we can add more optimizations without having to review every frontend and backend again. We only have to make sure the optimizations are allowed according to the LangRef. And the LangRef says that re-associating multiple calls to I don't think Alive supports enough SIMD and fast-math to be useful here, but if it did, it would say that optimizing the program to arbitrarily reorder the constants I put in |
Again, they are not conflict. X86 always choosing tree-shaped order conforms the
I don't find it in LangRef. Can you point me the link?
I assume you may expect some transformation like from
to
I don't think it can happen in theory. Note, the |
There's a conflict between the frontend and LLVM IR LangRef. The backend is fine and irrelevant here, any discussion of what the backend does is moot. Unless the LangRef says so, the backend can't "reach through" directly to the frontend. It is important that optimizations are aware of any place where such a "reach through" happens, since they have to then be careful with how they do their transformations. LangRef says that
When I do
|
I don't think it's reasonable for any fast math flag to be set from purely the context of coming from a target intrinsic like this |
It looks like this is where the confusion coming from. The FMF flags have distinct scopes with different flags. E.g., I don't think we changed the meaning of
The IR associated fast-math flags are in fine grained granularity. They are intended to be used locally. A few of these flags don't represent the whole IR are compiled under fast-math option.
We have a few target intrinsics using in this way already. I don't think there's any problem given fast math flags have fine grained granularity with these instructions. |
This isn't really accurate. The flag is for that instruction only. Traditionally the usage of reassoc in practice would infect neighboring instructions, but #71277 starts moving towards fixing this.
And I already take issue with those, they should all be removed. |
Which part of the documentation do you base this on? If it's not in the documentation then either way the LangRef needs to be updated. (Though I agree with @arsenm that ideally fast-math flags would just not be used here.) Even if #include <immintrin.h>
float foo() {
__m512 y = _mm512_set_ps(
42.42, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0,
0.0, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0
);
return _mm512_reduce_add_ps(y);
} |
Do you mean #71277 or another one? IIUC, #71277 is a bug for InstCombine. I don't see what's the problem to use fast math flags in target intrinsics.
That's my understanding. @arsenm rephrased it more accurate. I agree that's soemthing we should improve in the documentation. But I'm afraid I cannot give a precise description.
I don't think LangRef is a good place to describe target specific behavior.
Both part 1 and 3 are target specific code, we can guarantee it won't be broken by other code.
I don't think anything unclear here that need to be extra documented except for
It cannot happen given |
Yes, the bug is permissive fast math flag handling. It only considers reassoc on one instruction and allows it to infect its neighbors.
The builtin does not mean "go fast" it means give me the behavior of this instruction. This is not license for any relaxed handling. |
In fact, the |
This does not mean that _mm512_reduce_add_ps implies it can use a relaxed order by default |
_mm512_reduce_add_ps as a target intrinsic can have specific reduction order; |
This is an overly clever recycling of the reassoc bit. Other code relying on treating it as a normal FMF flag will break. What is supposed to happen when a transform looking for reassociate the reduction with another operation? More importantly, fast math flags can be freely dropped. Relying on this to get a target specific lowering makes it semantically load bearing which is not OK |
(Strangely Github does not seem to send email notifications for @phoebewang's messages... is anyone else having that same problem? Their earlier posts triggered notifications just fine, but the recent ones did not. I am still getting notifications for other people in this thread. Very strange.)
Any target-specific behavior that optimizations must be aware of must be in the LangRef. I agree that it is not good when that happens, LangRef and LLVM IR in general should have target-independent behavior. That's why there shouldn't be target-specific behavior to begin with. But you are the one suggesting that But of course the better fix, IMO, is to not have target-specific behavior in the first place and to not use |
I'm not strongly objecting to turn the reassoc bit into a boolean parameter, but I still think the reassoc semantic perfectly meets the functionality of
We don't have a definition of reassociated reduction. We use
That's opposite to my memory. I recall we take the drop of fast math flags as bugs and fixed them several years ago. No sure if there's still unsolved issues.
That's not ture. I explained we use the natural semantic of |
This is buggy by design, and these bugs will recur. This is special casing increasing complexity. It's a deeper truth that FMF flags are droppable optimization hints than this forgettable note on a specific intrinsic |
We have a definition, it's even in the LangRef: it means the addition can happen in any order.
You claimed that but your claim doesn't match up with the docs.
No. It means that the program has the semantics "pick an arbitrary order". There's a big difference between
The LangRef doesn't say that the target can pick an order, the LangRef says that the order is not preserved. The meaning of this is unambiguous: any LLVM IR pass may change the order any way it pleases.
It is not logically possible to say that there are target-specific semantics in frontend and backend but not in the middle. Semantics flow from frontend to IR and from IR to backend. Anything that's not captured in the IR is lost and cannot be recovered. If I (the fronted) give a friend of mine (the middle-end) a bunch of numbers and say "please add them in any order", and then that friend shuffles the numbers and then goes to their friend (the backend) and says "please add them in any order", and then the backend adds them in tree order -- then my friend perfectly satisfied the request that I made. I can't later say "oh but when I said 'in any order' I meant 'in the same order as what the backend would do'". If that's what I want I have to say it! |
Trying to make things more concrete: Consider a snippet of the form
Do you agree to one of these statements? Both? If you agree to both you must agree that there is a bug. If you disagree with one then I would be curious on which basis. |
I hold the opposite point. It's unambiguously not allowed. Two reasons:
But only limited to a few operations and within the operation's own scope.
That's based on you assumption the middle-end can shuffle it. I have proved it is not allowed at the beginning.
To be specific, I'm saying packaging
I'm not the initial author of the intrinsic, but I can understand the goal is to provide a fast implementation of reduction operations. The order is never an important factor. One prove is we even described it in sequential order for years in the document.
Saying the whole FMF a buggy design is beyond the topic we are discussing here. |
I strongly disagree this is proven not allowed. My interpretation of the langref is reassoc on the vector.reduce implies the lowered form of the intrinsic is a sequence of fadds, each with the reassoc set. It is correct to apply this transformation at any point, and the IR is not simply a vehicle for delivering
This is absolutely not how the IR works. The IR has a meaning standalone, independent of any frontend usage or backend lowering decisions. This is a fundamental aspect of the IR in a modular compiler.
Yes, precisely. Optimization hints are not semantically enforceable
This is not a request for an optimized reduction, it's a request for the instruction by name I insist several points:
|
I insist my points.
Do you mean Let's consider two scenarios:
You cannot make We either need to introduce a mechanism to distinguish them, or not to support 2) at all. Note, turning 2) to 1), or even turning 1) to sequential
This is a single vector type within a single intrinsic. How a transformation change element order without extra operations like shuffle? How a It turns out it's a safe vehicle given we proved transformation can do nothing with it.
If we back to the |
The docs say it is allowed. I don't know what else there is to say here. Why do you think you can just ignore the docs?
No I am not. The
They make things fast by allowing the middle-end to arbitrarily reorder arithmetic operations.
They affect only the operations that carry the flags. I can have other float ops in the same function that are guaranteed to still have the standard IEEE semantics. @arsenm wrote
Absolutely. That seems to be the fundamental misundestanding here. @phoebewang does not accept the IR as an abstraction, a language in its own right. However, treating it as such is the only way to keep an ecosystem like LLVM together. That's why the LangRef exists in the first place. It is the only authority in a question like this. And it is unambiguous in this instance. @phoebewang, you keep making claims that are entirely unsubstantiated by the LangRef. I don't know why you think that would be a valid argument. It is not. I don't think a change to the LangRef that changes But it also doesn't seem like this discussion is going anywhere, we're just repeating the same statements over and over. I hope an intrinsic/flag for "tree reduction" can be added, then Rust will use it -- whether clang picks that up is up to them. |
It happens to me this time :)
Alright, let's literally interpret the LangRef. In the Fast-Math Flags section, is says:
It's unambiguous that middle-end can do optimizations among only these 10 operations including call. For call operation, it says:
It's not unambiguous middle-end can do optimizations for call instruction itself, or including (identifying and optimizing) the functions/intrinsics it calls. So it looks to me the question is just based on unsubstantiated claims by the LangRef that middle-end can do I don't know if in practice there's such optimizations do the latter, but it's not important. We just need to list these intrinsics together with such operations to make the doc clear. That says, we don't have to declare
Middle-end is not the only consumer of fast math flags. They are consumed a lot by backend too.
That should have been an independent thing and I'm not opposite to it. |
reassoc means reassoc for any FPMathOperator, which is open to interpretation of what reassoc means for every operation. The LangRef does not spell out what it means for every single opcode. The fact that reassoc tries to call out a specific behavior in this one case is problematic. The LangRef can hand wave about this case, but that doesn't mean it's a good or workable design. I think the reduce reassoc flag treatment just needs to be dropped. It's incompatible with the design of fast math flags. It doesn't compose well with neighboring instructions after lowering, and cannot be semantics bearing. These intrinsics would need to grow an additional operand for the reduction ordering or similar (which could then have different interpretations of what reassoc+ordering type interact during lowering) |
It just defers ambiguous interpretations to the backends in this way. There's no such a blacklist mechanism to stop emitting a
So backends now face the puzzle to honor I think we should review semantic of fast math flags from how they are being used. Given both middle-end and backend are the consumers, I think they should have two basic semantics for middle-end:
We don't need to distinguish them for OTOH, some of called functions/intrinsics do need some "action related" flags, e.g, In a word, I think we cannot assume middle-end can arbitrarily consume fast math flags assigned with |
They have to have one semantic meaning. Special casing the meaning is simply a bad design
No. There is nothing special about call. It's simply an alternative representation of an IR operation. It cannot have special rules for common flags
They aren't transparent, the entire point of intrinsics is they have compiler known behavior. Trying to special case call simply breaks fast math flags. They must have uniform interpretation |
It's specific to the characteristics of fast math flags and call instructions. We can explain it in LangRef. And speaking of bad design, it is already bad to mix "value related" flags with "action related" in the same group. I think it would be more clear if we have two groups flags defined separately.
It's special when combining call with fast math flags. It's undocumented, but I'll reaffirm my previous interpretations:
As I explained above, the exceptions result from LLVM lacking a mechanism to represent a scope of multi If we interpret fast math flags along this way, it is nature Furthermore, it implies middle-end only has limited permission in optimizing fast math flags, considering the general optimizations in middle-end are across instructions. This is not to say middle-end won't do optimization like transforming
The third bullet looks controversial, but I have proved it be guaranteed in theory.
I interpreted it from backend's perspective. By transparent, I mean middle-end either bypass it or otherwise transform it to a different one. So when the backend sees one intrinsic, it must have been transparent to the middle-end. Another reason is call means all intrinsics follow the same rule, but it is not true for fast math flags. Most intrinsics don't support fast math flags, besides, some intrinsics like |
This
produces
According to the LangRef, the
reassoc
here means that the addition may happen in any order, which is not what Intel documents -- they specify a particular, "tree-like" order.Even worse, we can chain two of these operations:
Now the second addition may be arbitrarily re-associated with the first one. As far as I understand, there's nothing about
reassoc
that constrains the re-association to only happen "inside" a single operation (and indeed, as a fast-math flag it is explicitly intended to apply when multiple subsequent operations are allreassoc
)._mm512_reduce_add_ps should probably either use a vendor-specific intrinsic, or LLVM IR needs a version of
vector.reduce.fadd
that explicitly specifies the "tree-like" reduction order documented by Intel.The text was updated successfully, but these errors were encountered: