Skip to content

clang: _mm512_reduce_add_ps lowers to LLVM IR that does not reflect correct reduce order #82813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RalfJung opened this issue Feb 23, 2024 · 37 comments

Comments

@RalfJung
Copy link
Contributor

This

#include <immintrin.h>
float foo(__m512 x) {
    return _mm512_reduce_add_ps(x);
}

produces

define dso_local noundef float @foo(float vector[16])(<16 x float> noundef %x) local_unnamed_addr #0 {
entry:
  %0 = tail call reassoc noundef float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x)
  ret float %0
}

According to the LangRef, the reassoc here means that the addition may happen in any order, which is not what Intel documents -- they specify a particular, "tree-like" order.

Even worse, we can chain two of these operations:

#include <immintrin.h>
float foo(__m512 x) {
    float xr = _mm512_reduce_add_ps(x);
    __m512 y = _mm512_set_ps(
        xr, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0,
        0.0, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0
    );
    return _mm512_reduce_add_ps(y);
}

Now the second addition may be arbitrarily re-associated with the first one. As far as I understand, there's nothing about reassoc that constrains the re-association to only happen "inside" a single operation (and indeed, as a fast-math flag it is explicitly intended to apply when multiple subsequent operations are all reassoc).

_mm512_reduce_add_ps should probably either use a vendor-specific intrinsic, or LLVM IR needs a version of vector.reduce.fadd that explicitly specifies the "tree-like" reduction order documented by Intel.

@github-actions github-actions bot added the clang Clang issues not falling into any other category label Feb 23, 2024
@dtcxzyw dtcxzyw added backend:X86 vectorizers floating-point Floating-point math and removed clang Clang issues not falling into any other category labels Feb 23, 2024
@llvmbot
Copy link
Member

llvmbot commented Feb 23, 2024

@llvm/issue-subscribers-backend-x86

Author: Ralf Jung (RalfJung)

This ```C #include <immintrin.h> float foo(__m512 x) { return _mm512_reduce_add_ps(x); } ``` [produces](https://godbolt.org/z/qera4378s) ``` define dso_local noundef float @foo(float vector[16])(<16 x float> noundef %x) local_unnamed_addr #0 { entry: %0 = tail call reassoc noundef float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x) ret float %0 } ``` According to the [LangRef](https://llvm.org/docs/LangRef.html#fast-math-flags), the `reassoc` here means that the addition may happen in *any* order, which is not what [Intel documents](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_reduce_add_ps&expand=133&ig_expand=5303) -- they specify a particular, "tree-like" order.

Even worse, we can chain two of these operations:

#include &lt;immintrin.h&gt;
float foo(__m512 x) {
    float xr = _mm512_reduce_add_ps(x);
    __m512 y = _mm512_set_ps(
        xr, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0,
        0.0, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0
    );
    return _mm512_reduce_add_ps(y);
}

Now the second addition may be arbitrarily re-associated with the first one. As far as I understand, there's nothing about reassoc that constrains the re-association to only happen "inside" a single operation (and indeed, as a fast-math flag it is explicitly intended to apply when multiple subsequent operations are all reassoc).

_mm512_reduce_add_ps should probably either use a vendor-specific intrinsic, or LLVM IR needs a version of vector.reduce.fadd that explicitly specifies the "tree-like" reduction order documented by Intel.

@anematode
Copy link
Contributor

anematode commented Feb 23, 2024

Presumably this applies to mul as well, and all masked variants thereof.

How about IR intrinsics reduce_tree.(fadd|fmul).v[n]f(16|32|64)? I could see this behavior being useful for other arches in the future, since this is the lowest-latency method of reducing floats. Also, under appropriate fast math flags, reduce_tree could become a plain reduce with reassoc.

Oh, there's also reduce_min and reduce_max intrinsics that I suppose aren't commutative either due to NaN inputs. So that would be 4 new IR intrinsics.

@phoebewang
Copy link
Contributor

Do you get unexpected result against the document? We lower this intrinsic in "tree-like" order in the backend when reassoc is set. So we intended to alwasy generate this flag for _mm512_reduce_add_ps.

@RalfJung
Copy link
Contributor Author

RalfJung commented Feb 24, 2024

Do you get unexpected result against the document?

I haven't checked. It doesn't matter. LLVM IR is an abstraction in its own right with its own semantics, and a lot of non-trivial infrastructure relying on those semantics. The clang frontend generates LLVM IR that does not properly capture its intent. It may happen to be the case that the backend currently does the right thing and no optimization pass disturbs these simple examples, but that's a fragile situation to rely on. It's akin to generating LLVM IR with UB and arguing "yeah but current optimizations don't exploit this UB so it's fine".

For instance, an LLVM IR pass could just remove reassoc. That would be an entirely correct transformation according to the LangRef. Just because currently no pass does this, doesn't mean no pass will do it in the future.

@phoebewang
Copy link
Contributor

Do you get unexpected result against the document?

I haven't checked. It doesn't matter. LLVM IR is an abstraction in its own right with its own semantics, and a lot of non-trivial infrastructure relying on those semantics. The clang frontend generates LLVM IR that does not properly capture its intent. It may happen to be the case that the backend currently does the right thing and no optimization pass disturbs these simple examples, but that's a fragile situation to rely on. It's akin to generating LLVM IR with UB and arguing "yeah but current optimizations don't exploit this UB so it's fine".

For instance, an LLVM IR pass could just remove reassoc. That would be an entirely correct transformation according to the LangRef. Just because currently no pass does this, doesn't mean no pass will do it in the future.

The reassoc has special meaning to reduce intrinsics. IIRC, it is intended to indicate backend to choose the best performed lowering. Because different targets have different associative method. A tree-like association performs better on X86 but may not on other targets.

In a word, allowing targets to choose their most efficient lowering with reassoc is one of the design goals of reduce intrinsics. And it is impossible a pass to remove reassoc if we believe passes will never do negative optimization, though we can solve the problem by turning the reassoc flag into a bool parameter of reduce intrinsics if necessary.

@RalfJung
Copy link
Contributor Author

RalfJung commented Feb 24, 2024

In a word, allowing targets to choose their most efficient lowering with reassoc is one of the design goals of reduce intrinsics.

We don't want the "most efficient" lowering, we want the exact lowering Intel documents. Maybe that's the same today but who knows what happens in the future. And reassoc doesn't document that it will definitely do any particular lowering.

Your argument also this relies on fast-math passes never using reassoc to actually re-associate multiple of these "reduce" operations, which they would totally be allowed to do.

So, overall it sounds like you are suggesting to overload reassoc to give it a second meaning, quite distinct from the meaning documented in LangRef. Why is that preferable over having a dedicated flag that avoids any confusion due to overloading?

@RKSimon
Copy link
Collaborator

RKSimon commented Feb 24, 2024

Looking back at #46850 when this was last discussed, the Intel docs then described the reduction as a scalar expansion:

float _mm512_reduce_add_ps (__m512 a)

dst[31:0] := 0.0
FOR j := 0 to 15
i := j*32
dst[31:0] := dst[31:0] + a[i+31:i]
ENDFOR

but it now describes it as this which matches the llvm expansion:

DEFINE REDUCE_ADD(src, len) {
	IF len == 2
		RETURN src[31:0] + src[63:32]
	FI
	len := len / 2
	FOR j:= 0 to (len-1)
		i := j*32
		src[i+31:i] := src[i+31:i] + src[i+32*len+31:i+32*len]
	ENDFOR
	RETURN REDUCE_ADD(src[32*len-1:0], len)
}
dst[31:0] := REDUCE_ADD(a, 16)

@RalfJung
Copy link
Contributor Author

Wait, Intel changed their spec? oO

@RKSimon
Copy link
Collaborator

RKSimon commented Feb 24, 2024

I'm not sure (it was 3 years ago!), I'm going off my comments in the previous issue.

@phoebewang
Copy link
Contributor

We switched the implementation from x86 specific intrinsic to LLVM reduction intrinsic at the time, but the behavior is never changed. The document didn't describe the "tree-like" reduction precisely, so we updated the doc too.

We don't want the "most efficient" lowering, we want the exact lowering Intel documents. Maybe that's the same today but who knows what happens in the future. And reassoc doesn't document that it will definitely do any particular lowering.

They are not conflict. "most efficient" means different lowering for different targets, for X86, they are constant at the moment. It's true we may introduce new instructions in different orders in the future and they may share the same interface. But it is only a hypothetical problem to the future.

Your argument also this relies on fast-math passes never using reassoc to actually re-associate multiple of these "reduce" operations, which they would totally be allowed to do.

I don't see the flag generates unexpected result in multiple "reduce" operations, through O0 to Ofast https://godbolt.org/z/c3WWrG965

So, overall it sounds like you are suggesting to overload reassoc to give it a second meaning, quite distinct from the meaning documented in LangRef. Why is that preferable over having a dedicated flag that avoids any confusion due to overloading?

No, I didn't mean it has a second meaning. We use reassoc exactly the meaning by its defination. That's why different targets can choose different lowering. I'm not forcing a "tree-like" order to reassoc either. I mean we don't necessarily to introduce a dedicated flag given we have constant "tree-like" order with reassoc on X86.

@RalfJung
Copy link
Contributor Author

RalfJung commented Feb 25, 2024

We use reassoc exactly the meaning by its defination.

You are not.

reassoc generally means "pick any order". any meaning "the backend can unilaterally change what it does tomorrow without consulting with anyone else". It's very clearly documented in the LangRef. Now you are saying reassoc on a particular intrinsic has a different meaning from anywhere else, where now it means "on x86, specifically tree-shaped".

I don't see the flag generates unexpected result in multiple "reduce" operations, through O0 to Ofast https://godbolt.org/z/c3WWrG965

As I said above, that doesn't matter. It just means today's optimizations don't do this. But the point of LLVM IR having a LangRef is that we can add more optimizations without having to review every frontend and backend again. We only have to make sure the optimizations are allowed according to the LangRef. And the LangRef says that re-associating multiple calls to reassoc @llvm.vector.reduce.fadd is allowed. It is mere coincidence that today no optimization pass kicks in on your example.

I don't think Alive supports enough SIMD and fast-math to be useful here, but if it did, it would say that optimizing the program to arbitrarily reorder the constants I put in _mm512_set_ps would be legal. It would even allow swapping some of these constants with lanes of x. There's no reason LLVM would do such a transformation in this concrete case, but this is just an example, and it's impossible to predict what happens in more complicated real code.

@phoebewang
Copy link
Contributor

reassoc generally means "pick any order". any meaning "the backend can unilaterally change what it does tomorrow without consulting with anyone else". It's very clearly documented in the LangRef. Now you are saying reassoc on a particular intrinsic has a different meaning from anywhere else, where now it means "on x86, specifically tree-shaped".

Again, they are not conflict. X86 always choosing tree-shaped order conforms the reassoc semantic in IR. And backend is always free to choose particular lowering for particular intrinsics as long as it conforms intrinsics' semantic. The promise comes from x86 backend rather than changing anything in the semantic of reassoc in IR.

And the LangRef says that re-associating multiple calls to reassoc @llvm.vector.reduce.fadd is allowed.

I don't find it in LangRef. Can you point me the link?

I don't think Alive supports enough SIMD and fast-math to be useful here, but if it did, it would say that optimizing the program to arbitrarily reorder the constants I put in _mm512_set_ps would be legal. It would even allow swapping some of these constants with lanes of x. There's no reason LLVM would do such a transformation in this concrete case, but this is just an example, and it's impossible to predict what happens in more complicated real code.

I assume you may expect some transformation like from

  %0 = call reassoc float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x)
  %vecinit15.i = insertelement <16 x float> <float 9.000000e+00, float 0x401ACCCCC0000000, ..., float poison>, float %0, i64 15
  %1 = call reassoc float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %vecinit15.i)

to

  %0 = shufflevector <16 x float> %x, <16 x float> <float 9.000000e+00, float 0x401ACCCCC0000000, ..., float poison>, <32 x i32> <...>
  %1 = call reassoc float @llvm.vector.reduce.fadd.v32f32(float -0.000000e+00, <32 x float> %0)

I don't think it can happen in theory. Note, the reassoc is a local flag to the single llvm.vector.reduce.fadd.v16f32 intrinsic. It is not a global flag to all the three expressions. We lack a way to represent the reassoc relationship from the first reduction result to the second reduction operation in IR. And even it get enhanced in the future, there must be method to distinguish local reassoc with global one. For the former, we still can preserve the particular order within one intrinsic. For the latter, it must be used for fast-math scenarios, the oder doesn't matter already.

@RalfJung
Copy link
Contributor Author

RalfJung commented Feb 25, 2024

Again, they are not conflict. X86 [...]

There's a conflict between the frontend and LLVM IR LangRef. The backend is fine and irrelevant here, any discussion of what the backend does is moot. Unless the LangRef says so, the backend can't "reach through" directly to the frontend. It is important that optimizations are aware of any place where such a "reach through" happens, since they have to then be careful with how they do their transformations.

LangRef says that reassoc means "Allow reassociation transformations for floating-point instructions. This may dramatically change results in floating-point". This is the contract that all parties using LLVM IR are bound by. This means addition of the elements in the vector can happen in any order. That is not the intent of _mm512_reduce_add_ps, therefore something needs to change:

  • either clang needs to generate different IR
  • or LangRef needs to change to provide stronger guarantees whenreassoc is applied to vector.reduce operations

I don't think it can happen in theory. Note, the reassoc is a local flag to the single llvm.vector.reduce.fadd.v16f32 intrinsic. It is not a global flag to all the three expressions.

When I do reassoc on normal scalar floating-point addition, it's also a per-operation "local" flag, and yet it controls how multiple of these operations following after another work. That's how reassoc works: when a value flows from one operation directly to another, and both have reassoc, then the computation may be re-associated.

For the latter, it must be used for fast-math scenarios, the oder doesn't matter already.

reassoc is inherently a fast-math flag. That's the scenario it is made for, no matter where in the IR it is used. At least, I haven't found any docs that would indicate anything else.

@arsenm
Copy link
Contributor

arsenm commented Feb 25, 2024

I don't think it's reasonable for any fast math flag to be set from purely the context of coming from a target intrinsic like this

@phoebewang
Copy link
Contributor

phoebewang commented Feb 25, 2024

When I do reassoc on normal scalar floating-point addition, it's also a per-operation "local" flag, and yet it controls how multiple of these operations following after another work. That's how reassoc works: when a value flows from one operation directly to another, and both have reassoc, then the computation may be re-associated.

It looks like this is where the confusion coming from. The FMF flags have distinct scopes with different flags. E.g., arcp and afn are "local" scope, while contract and reassoc are typically used across instructions. But that is due to LLVM lack a method to represent a scope of multi fadd, fmul etc. operations. While for these vector.reduce operations, they have born scope within the vector. So they cannot be used the same way as fadd, fmul etc.

I don't think we changed the meaning of reassoc on vector.reduce operations, so we don't need to change LangRef for it. Instead, we need to change the meaning of reassoc on vector.reduce operations if we want the opposite behavior, i.e., reassociate across multiple reduction operations.

reassoc is inherently a fast-math flag. That's the scenario it is made for, no matter where in the IR it is used.

The IR associated fast-math flags are in fine grained granularity. They are intended to be used locally. A few of these flags don't represent the whole IR are compiled under fast-math option.

I don't think it's reasonable for any fast math flag to be set from purely the context of coming from a target intrinsic like this

We have a few target intrinsics using in this way already. I don't think there's any problem given fast math flags have fine grained granularity with these instructions.

@arsenm
Copy link
Contributor

arsenm commented Feb 26, 2024

It looks like this is where the confusion coming from. The FMF flags have distinct scopes with different flags. E.g., arcp and afn are "local" scope, while contract and reassoc are typically used across instructions.

This isn't really accurate. The flag is for that instruction only. Traditionally the usage of reassoc in practice would infect neighboring instructions, but #71277 starts moving towards fixing this.

We have a few target intrinsics using in this way already. I don't think there's any problem given fast math flags have fine grained granularity with these instructions.

And I already take issue with those, they should all be removed.

@RalfJung
Copy link
Contributor Author

RalfJung commented Feb 26, 2024

It looks like this is where the confusion coming from. The FMF flags have distinct scopes with different flags. E.g., arcp and afn are "local" scope, while contract and reassoc are typically used across instructions. But that is due to LLVM lack a method to represent a scope of multi fadd, fmul etc. operations. While for these vector.reduce operations, they have born scope within the vector. So they cannot be used the same way as fadd, fmul etc.

Which part of the documentation do you base this on? If it's not in the documentation then either way the LangRef needs to be updated. (Though I agree with @arsenm that ideally fast-math flags would just not be used here.)

Even if reassoc somehow is "local" in scope for vector.reduce, there's still a further change needed: LangRef would have to say that on X86, reassoc vector.reduce will definitely use tree-shaped reduction. This is required to reflect _mm512_reduce_add_ps semantics. Without that, LLVM passes may e.g. arbitrarily reorder the elements of the vector in

#include <immintrin.h>
float foo() {
    __m512 y = _mm512_set_ps(
        42.42, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0,
        0.0, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0
    );
    return _mm512_reduce_add_ps(y);
}

@phoebewang
Copy link
Contributor

phoebewang commented Feb 26, 2024

And I already take issue with those, they should all be removed.

Do you mean #71277 or another one? IIUC, #71277 is a bug for InstCombine. I don't see what's the problem to use fast math flags in target intrinsics.

Which part of the documentation do you base this on? If it's not in the documentation then either way the LangRef needs to be updated.

That's my understanding. @arsenm rephrased it more accurate. I agree that's soemthing we should improve in the documentation. But I'm afraid I cannot give a precise description.

Even if reassoc somehow is "local" in scope for vector.reduce, there's still a further change needed: LangRef would have to say that on X86, reassoc vector.reduce will definitely use tree-shaped reduction. This is required to reflect _mm512_reduce_add_ps semantics.

I don't think LangRef is a good place to describe target specific behavior.
Actually, the semantics of _mm512_reduce_add_ps is quite clear. We can divide it into 3 parts:

  • _mm512_reduce_add_ps to reassoc vector.reduce in FE;
  • reassoc vector.reduce from FE to BE;
  • Tree-shaped lowering of reassoc vector.reduce in X86 BE;

Both part 1 and 3 are target specific code, we can guarantee it won't be broken by other code.
I also argued your concerns about part 2 that may be twisted by middle end transformation. It won't come ture due to

  • Pass shouldn't remove reassoc in any case;
  • reassoc is for associated instruction only;

I don't think anything unclear here that need to be extra documented except for reassoc semantics.

Without that, LLVM passes may e.g. arbitrarily reorder the elements of the vector in

It cannot happen given reassoc is for associated instruction only.

@arsenm
Copy link
Contributor

arsenm commented Feb 29, 2024

And I already take issue with those, they should all be removed.

Do you mean #71277 or another one? IIUC, #71277 is a bug for InstCombine.

Yes, the bug is permissive fast math flag handling. It only considers reassoc on one instruction and allows it to infect its neighbors.

I don't see what's the problem to use fast math flags in target intrinsics.

The builtin does not mean "go fast" it means give me the behavior of this instruction. This is not license for any relaxed handling.

@phoebewang
Copy link
Contributor

The builtin does not mean "go fast" it means give me the behavior of this instruction. This is not license for any relaxed handling.

In fact, the vector.reduce.f* builtins do have two behavior: one is in sequential order and one is relaxed order, controlled by reassoc. It is documented in LangRef: https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fadd-intrinsic

@arsenm
Copy link
Contributor

arsenm commented Feb 29, 2024

The builtin does not mean "go fast" it means give me the behavior of this instruction. This is not license for any relaxed handling.

In fact, the vector.reduce.f* builtins do have two behavior: one is in sequential order and one is relaxed order, controlled by reassoc. It is documented in LangRef: https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fadd-intrinsic

This does not mean that _mm512_reduce_add_ps implies it can use a relaxed order by default

@phoebewang
Copy link
Contributor

The builtin does not mean "go fast" it means give me the behavior of this instruction. This is not license for any relaxed handling.

In fact, the vector.reduce.f* builtins do have two behavior: one is in sequential order and one is relaxed order, controlled by reassoc. It is documented in LangRef: https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fadd-intrinsic

This does not mean that _mm512_reduce_add_ps implies it can use a relaxed order by default

_mm512_reduce_add_ps as a target intrinsic can have specific reduction order;
vector.reduce.fadd as a target independent builtin provides both sequential order and relaxed order;
So what's the problem if we alway map _mm512_reduce_add_ps to one of its order?

@arsenm
Copy link
Contributor

arsenm commented Feb 29, 2024

_mm512_reduce_add_ps as a target intrinsic can have specific reduction order; vector.reduce.fadd as a target independent builtin provides both sequential order and relaxed order; So what's the problem if we alway map _mm512_reduce_add_ps to one of its order?

This is an overly clever recycling of the reassoc bit. Other code relying on treating it as a normal FMF flag will break. What is supposed to happen when a transform looking for reassociate the reduction with another operation?

More importantly, fast math flags can be freely dropped. Relying on this to get a target specific lowering makes it semantically load bearing which is not OK

@RalfJung
Copy link
Contributor Author

(Strangely Github does not seem to send email notifications for @phoebewang's messages... is anyone else having that same problem? Their earlier posts triggered notifications just fine, but the recent ones did not. I am still getting notifications for other people in this thread. Very strange.)

I don't think LangRef is a good place to describe target specific behavior.

Any target-specific behavior that optimizations must be aware of must be in the LangRef. I agree that it is not good when that happens, LangRef and LLVM IR in general should have target-independent behavior. That's why there shouldn't be target-specific behavior to begin with. But you are the one suggesting that reassoc should have a target-specific meaning. Are you suggesting that we should make it target-specific but then not document that? That's the worst of all possible outcomes, actively deciding to put a footgun into LLVM IR semantics. It means all LLVM IR passes need to be careful not to break undocumented target-specific behavior, which is a lot worse than having to describe target-specific behavior in the LangRef.

But of course the better fix, IMO, is to not have target-specific behavior in the first place and to not use reassoc for things entirely unrelated to fast-math.

@phoebewang
Copy link
Contributor

This is an overly clever recycling of the reassoc bit. Other code relying on treating it as a normal FMF flag will break.

I'm not strongly objecting to turn the reassoc bit into a boolean parameter, but I still think the reassoc semantic perfectly meets the functionality of vector.reduce.fadd. It also works as a normal FMF flag. If it breaks some code, that probably be problem of the code.

What is supposed to happen when a transform looking for reassociate the reduction with another operation?

We don't have a definition of reassociated reduction. We use reassoc to decorate the fadd inside of a reduction.
Even if we want the semantic of reassoc of reduction one day, we still need to distinguish them. So that would not be a real concern.

More importantly, fast math flags can be freely dropped. Relying on this to get a target specific lowering makes it semantically load bearing which is not OK

That's opposite to my memory. I recall we take the drop of fast math flags as bugs and fixed them several years ago. No sure if there's still unsolved issues.

But you are the one suggesting that reassoc should have a target-specific meaning.

That's not ture. I explained we use the natural semantic of reassoc here.
And reassoc with vector.reduce.f* means each target can select their own association oder. There's no target-specific meaning here either.
The target specific thing only happens in target specific code in the front end and backend. The vector.reduce.f* has constant target independent semantic in the process.

@arsenm
Copy link
Contributor

arsenm commented Feb 29, 2024

That's opposite to my memory. I recall we take the drop of fast math flags as bugs and fixed them several years ago. No sure if there's still unsolved issues.

This is buggy by design, and these bugs will recur. This is special casing increasing complexity. It's a deeper truth that FMF flags are droppable optimization hints than this forgettable note on a specific intrinsic

@RalfJung
Copy link
Contributor Author

RalfJung commented Feb 29, 2024

We don't have a definition of reassociated reduction. We use reassoc to decorate the fadd inside of a reduction.

We have a definition, it's even in the LangRef: it means the addition can happen in any order.

But you are the one suggesting that reassoc should have a target-specific meaning.

That's not ture. I explained we use the natural semantic of reassoc here.

You claimed that but your claim doesn't match up with the docs.

And reassoc with vector.reduce.f* means each target can select their own association oder.

No. It means that the program has the semantics "pick an arbitrary order". There's a big difference between

  • some specific order that is chosen by the backend (but no other part of the pipeline has any say about what the order is)
  • an absolutely arbitrary order, where every pass may change the order in any way it pleases

The LangRef doesn't say that the target can pick an order, the LangRef says that the order is not preserved. The meaning of this is unambiguous: any LLVM IR pass may change the order any way it pleases.

The target specific thing only happens in target specific code in the front end and backend. The vector.reduce.f* has constant target independent semantic in the process.

It is not logically possible to say that there are target-specific semantics in frontend and backend but not in the middle. Semantics flow from frontend to IR and from IR to backend. Anything that's not captured in the IR is lost and cannot be recovered.

If I (the fronted) give a friend of mine (the middle-end) a bunch of numbers and say "please add them in any order", and then that friend shuffles the numbers and then goes to their friend (the backend) and says "please add them in any order", and then the backend adds them in tree order -- then my friend perfectly satisfied the request that I made. I can't later say "oh but when I said 'in any order' I meant 'in the same order as what the backend would do'". If that's what I want I have to say it!

@RalfJung
Copy link
Contributor Author

RalfJung commented Feb 29, 2024

Trying to make things more concrete:

Consider a snippet of the form _mm512_reduce_add_ps(_mm512_set_ps(...)) . I claim that:

  • Currently we could add a pass that arbitrarily reorders the arguments to set. It doesn't matter whether there's a practical reason why one would want to do this, it's an example -- think of it as a thought experiment. Imagine something similar to this comes up as part of a larger transformation or so, it's entirely irrelevant. Relevant is only the question: is that transformation allowed? And given the LLVM IR that clang generates, and the semantics described in the LangRef, I think the answer is unambiguous: yes this is allowed.
  • Meanwhile, the Intel docs say that the addition has to happen in a very specific order.

Do you agree to one of these statements? Both? If you agree to both you must agree that there is a bug. If you disagree with one then I would be curious on which basis.

@phoebewang
Copy link
Contributor

Relevant is only the question: is that transformation allowed? And given the LLVM IR that clang generates, and the semantics described in the LangRef, I think the answer is unambiguous: yes this is allowed.

I hold the opposite point. It's unambiguously not allowed. Two reasons:

We have a definition, it's even in the LangRef: it means the addition can happen in any order.

But only limited to a few operations and within the operation's own scope.
I'm thought experimenting a reassociated reduction, i.e., arbitrarily switch element across reduction instruction. That is definitely not defined.
And considering reducing first 16 elements in arbitrary oder and then reducing the result with other 15 elements is different from arbitrarily reducing the 31 elements in a whole. We need to invent a new flag to distinguish both. So it is not a problem in the future either.

The meaning of this is unambiguous: any LLVM IR pass may change the order any way it pleases.
If I (the fronted) give a friend of mine (the middle-end) a bunch of numbers and say "please add them in any order", and then that friend shuffles the numbers and then goes to their friend (the backend) and says "please add them in any order", and then the backend adds them in tree order -- then my friend perfectly satisfied the request that I made. I can't later say "oh but when I said 'in any order' I meant 'in the same order as what the backend would do'". If that's what I want I have to say it!

That's based on you assumption the middle-end can shuffle it. I have proved it is not allowed at the beginning.

It is not logically possible to say that there are target-specific semantics in frontend and backend but not in the middle.

To be specific, I'm saying packaging _mm512_reduce_add_ps to reassoc vector.reduce and unpackaging reassoc vector.reduce to tree-like order are all under target-specific code control. The middle is like a postman that passes reassoc vector.reduce from front end to the backend without any modification. I proved this point above.

Meanwhile, the Intel docs say that the addition has to happen in a very specific order.

I'm not the initial author of the intrinsic, but I can understand the goal is to provide a fast implementation of reduction operations. The order is never an important factor. One prove is we even described it in sequential order for years in the document.
The tree-like order is not interesting at all (even not than a binary tree order). It is just a most-target-efficient oder to X86. That's it. If the order is concern, user should not use the intrinsic at the beginning.
The vector.reduce.f* shares the same design philosophy. It only cares two scenarios: sequential order and fast implementation. The latter meets the exact requirment of _mm512_reduce_add_ps, so we switched to it.
Since the fast implementation is constantly doing in this order, so the document faithfully reflects it.

This is buggy by design, and these bugs will recur. This is special casing increasing complexity. It's a deeper truth that FMF flags are droppable optimization hints than this forgettable note on a specific intrinsic

Saying the whole FMF a buggy design is beyond the topic we are discussing here.
In that way, we can extend to that any optimization are droppable.
Then _mm512_reduce_add_ps as an optimised reduction, the fast implementation is droppable too.
So either way, we still can use the flag as is.

@arsenm
Copy link
Contributor

arsenm commented Mar 6, 2024

That's based on you assumption the middle-end can shuffle it. I have proved it is not allowed at the beginning.

I strongly disagree this is proven not allowed. My interpretation of the langref is reassoc on the vector.reduce implies the lowered form of the intrinsic is a sequence of fadds, each with the reassoc set. It is correct to apply this transformation at any point, and the IR is not simply a vehicle for delivering

The middle is like a postman that passes reassoc vector.reduce from front end to the backend without any modification. I proved this point above.

This is absolutely not how the IR works. The IR has a meaning standalone, independent of any frontend usage or backend lowering decisions. This is a fundamental aspect of the IR in a modular compiler.

In that way, we can extend to that any optimization are droppable.

Yes, precisely. Optimization hints are not semantically enforceable

Then _mm512_reduce_add_ps as an optimised reduction, the fast implementation is droppable too.
So either way, we still can use the flag as is.

This is not a request for an optimized reduction, it's a request for the instruction by name

I insist several points:

  1. reassoc is a droppable fast math flag. reduce intrinsics do not get special rules for this
  2. The builtin does not imply the intent is go fast, or that there is any flexibility in the interpretation.
  3. reassoc or any other fast math flag should not be implied without any fast math enabling context

@phoebewang
Copy link
Contributor

phoebewang commented Mar 6, 2024

I strongly disagree this is proven not allowed.

I insist my points.

My interpretation of the langref is reassoc on the vector.reduce implies the lowered form of the intrinsic is a sequence of fadds, each with the reassoc set.

Do you mean reassoc vector.reduce.fadd = reassoc fadd + reassoc fadd + ...?
I think that does violate the both definitions and conflict with your previous point "flag is for that instruction only".

Let's consider two scenarios:

  1. User is confident it's safe to reassocate among vector elements, and the reault is safe to reassocate with other fadd;
  2. User is confident it's safe to reassocate among vector elements and outer fadd;

You cannot make reassoc vector.reduce.fadd a sequence of fadds if you agree there are difference between these two scenarios.

We either need to introduce a mechanism to distinguish them, or not to support 2) at all.

Note, turning 2) to 1), or even turning 1) to sequential vector.reduce.fadd are both allowed, because "optimization is not enforceable". But turning 1) to 2) is not allow, because they have different semantic.

It is correct to apply this transformation at any point, and the IR is not simply a vehicle for delivering

This is a single vector type within a single intrinsic. How a transformation change element order without extra operations like shuffle? How a reassoc in vector.reduce.fadd affect a shuffle instruction?

It turns out it's a safe vehicle given we proved transformation can do nothing with it.

I insist several points:

  1. reassoc is a droppable fast math flag. reduce intrinsics do not get special rules for this
  2. The builtin does not imply the intent is go fast, or that there is any flexibility in the interpretation.
  3. reassoc or any other fast math flag should not be implied without any fast math enabling context
  1. Ignored by optimization is different from be droppable in the middle. We have function attributes, module flags used for optimization too. Can these be dropped by early passes if they are known for optimization? Similarly, dropping the fast math flag is a bug to middle end passes.

  2. The langref does not mention with a fast string. But what are fast flags designed for if they cannot make thing fast?

  3. If fast math flags are not allowed to be used per instruction. Why don't we just keeping them a function attribute?

This is not a request for an optimized reduction, it's a request for the instruction by name

If we back to the _mm512_reduce_add_ps itself, we don't need so complicated explanations. The intrunsion is not mapped to a single instruction but a sequence instructions. I help reviewed the document change and I'm clear the intention is to remind user they are not calculated in sequential order. We can update it to note the order is not always constant if necessary. It's not a hard request from the intrinsic.

@RalfJung
Copy link
Contributor Author

RalfJung commented Mar 6, 2024

I hold the opposite point. It's unambiguously not allowed.

The docs say it is allowed. I don't know what else there is to say here. Why do you think you can just ignore the docs?

You are trying to infect neighboring instruction _mm512_set_ps(...) by a flag to _mm512_reduce_add_ps only.

No I am not. The set_ps is not involved at all. Everything I said is happening within the reduction. The set_ps is only involved to create the vector that is being added, it entirely has its standard semantics. It's just about the order in which reduce does the addition. It can be any order. That's what the docs say. Therefore the elements can be arbitrarily reordered before adding them.

But what are fast flags designed for if they cannot make thing fast?

They make things fast by allowing the middle-end to arbitrarily reorder arithmetic operations.

If fast math flags are not allowed to be used per instruction. Why don't we just keeping them a function attribute?

They affect only the operations that carry the flags. I can have other float ops in the same function that are guaranteed to still have the standard IEEE semantics.

@arsenm wrote

This is absolutely not how the IR works. The IR has a meaning standalone, independent of any frontend usage or backend lowering decisions. This is a fundamental aspect of the IR in a modular compiler.

Absolutely. That seems to be the fundamental misundestanding here. @phoebewang does not accept the IR as an abstraction, a language in its own right. However, treating it as such is the only way to keep an ecosystem like LLVM together. That's why the LangRef exists in the first place. It is the only authority in a question like this. And it is unambiguous in this instance. @phoebewang, you keep making claims that are entirely unsubstantiated by the LangRef. I don't know why you think that would be a valid argument. It is not.

I don't think a change to the LangRef that changes reassoc to say "except on reduce it means something else" would be accepted. It should not, reassoc already has a purpose: it means the middle-end can change the order of additions arbitrarily.

But it also doesn't seem like this discussion is going anywhere, we're just repeating the same statements over and over. I hope an intrinsic/flag for "tree reduction" can be added, then Rust will use it -- whether clang picks that up is up to them.

@phoebewang
Copy link
Contributor

(Strangely Github does not seem to send email notifications for @phoebewang's messages... is anyone else having that same problem? Their earlier posts triggered notifications just fine, but the recent ones did not. I am still getting notifications for other people in this thread. Very strange.)

It happens to me this time :)

That's why the LangRef exists in the first place. It is the only authority in a question like this. And it is unambiguous in this instance.

@phoebewang, you keep making claims that are entirely unsubstantiated by the LangRef. I don't know why you think that would be a valid argument. It is not.

I don't think a change to the LangRef that changes reassoc to say "except on reduce it means something else" would be accepted. It should not, reassoc already has a purpose: it means the middle-end can change the order of additions arbitrarily.

Alright, let's literally interpret the LangRef. In the Fast-Math Flags section, is says:

LLVM IR floating-point operations (fneg, fadd, fsub, fmul, fdiv, frem, fcmp), phi, select and call may use the following flags to enable otherwise unsafe floating-point transformations.

It's unambiguous that middle-end can do optimizations among only these 10 operations including call. For call operation, it says:

The optional fast-math flags marker indicates that the call has one or more fast-math flags, which are optimization hints to enable otherwise unsafe floating-point optimizations. Fast-math flags are only valid for calls that return a floating-point scalar or vector type, or an array (nested to any depth) of floating-point scalar or vector types.

It's not unambiguous middle-end can do optimizations for call instruction itself, or including (identifying and optimizing) the functions/intrinsics it calls.

So it looks to me the question is just based on unsubstantiated claims by the LangRef that middle-end can do reassoc for reduction intrinsics.

I don't know if in practice there's such optimizations do the latter, but it's not important. We just need to list these intrinsics together with such operations to make the doc clear.

That says, we don't have to declare reduce is not supported by middle-end, but declare its support (and how) if we do.

They make things fast by allowing the middle-end to arbitrarily reorder arithmetic operations.

Middle-end is not the only consumer of fast math flags. They are consumed a lot by backend too.

I hope an intrinsic/flag for "tree reduction" can be added, then Rust will use it -- whether clang picks that up is up to them.

That should have been an independent thing and I'm not opposite to it.
My only concern is is it worth the complexity. Note, for the X86 intrinsic, they are not a typical binary tree reduction. That means we either make it one of "tree reduction" options, or there's no benefit to X86. For the former, each option needs an emulation for other targets don't have optimal lowering. For the latter, why don't we just do a shuffle instead, shuffle instrcutions are generally well tuned by each target.

@arsenm
Copy link
Contributor

arsenm commented Mar 8, 2024

So it looks to me the question is just based on unsubstantiated claims by the LangRef that middle-end can do reassoc for reduction intrinsics.

reassoc means reassoc for any FPMathOperator, which is open to interpretation of what reassoc means for every operation. The LangRef does not spell out what it means for every single opcode. The fact that reassoc tries to call out a specific behavior in this one case is problematic. The LangRef can hand wave about this case, but that doesn't mean it's a good or workable design.

I think the reduce reassoc flag treatment just needs to be dropped. It's incompatible with the design of fast math flags. It doesn't compose well with neighboring instructions after lowering, and cannot be semantics bearing. These intrinsics would need to grow an additional operand for the reduction ordering or similar (which could then have different interpretations of what reassoc+ordering type interact during lowering)

@phoebewang
Copy link
Contributor

I think the reduce reassoc flag treatment just needs to be dropped. It's incompatible with the design of fast math flags. It doesn't compose well with neighboring instructions after lowering, and cannot be semantics bearing. These intrinsics would need to grow an additional operand for the reduction ordering or similar (which could then have different interpretations of what reassoc+ordering type interact during lowering)

It just defers ambiguous interpretations to the backends in this way. There's no such a blacklist mechanism to stop emitting a reassoc for reduction intrinsics:

%reduce = call reassoc float @llvm.vector.reduce.fadd.v4f32(float -0.0, <4 x float> %input, bool %seq)

So backends now face the puzzle to honor reassoc or %seq when it lowers it to a sequence fadd operations.

I think we should review semantic of fast math flags from how they are being used. Given both middle-end and backend are the consumers, I think they should have two basic semantics for middle-end:

  • Pass these flags to the backend intactly if a pass doesn't optimize the instruction;
  • Optimize accordingly, resign proper flags and pass them to the backend intactly (by the following passes);

We don't need to distinguish them for fadd, fmul etc., but call is special. My interpretation is, the call instruction in middle-end is designed to only consume the "value related" flags, e.g., nnan, nsz etc., which is similar to phi and select. One side evidence is the doc emphasize "only valid for calls that return a floating-point type".

OTOH, some of called functions/intrinsics do need some "action related" flags, e.g, afn. They have distinct usage with the above case and the usage is somehow similar to calling convention attribute like “fastcc”.

In a word, I think we cannot assume middle-end can arbitrarily consume fast math flags assigned with call instruction, especially those "action related" flags. We should clarify flags are preserved together with called functions/intrinsics by default in middle-end. And consider functions/intrinsics are usually transparent to the middle-end, we can given a list of those might be modified by it. Fast math flags associated with functions/intrinsics otherwise are invariant to middle-end according to the semantic.

@arsenm
Copy link
Contributor

arsenm commented Mar 14, 2024

I think we should review semantic of fast math flags from how they are being used. Given both middle-end and backend are the consumers, I think they should have two basic semantics for middle-end:

They have to have one semantic meaning. Special casing the meaning is simply a bad design

  • Pass these flags to the backend intactly if a pass doesn't optimize the instruction;
  • Optimize accordingly, resign proper flags and pass them to the backend intactly (by the following passes);

We don't need to distinguish them for fadd, fmul etc., but call is special. My interpretation is, the call instruction in middle-end is designed to only consume the "value related" flags, e.g., nnan, nsz etc., which is similar to phi and select. One side evidence is the doc emphasize "only valid for calls that return a floating-point type".

No. There is nothing special about call. It's simply an alternative representation of an IR operation. It cannot have special rules for common flags

In a word, I think we cannot assume middle-end can arbitrarily consume fast math flags assigned with call instruction, especially those "action related" flags. We should clarify flags are preserved together with called functions/intrinsics by default in middle-end. And consider functions/intrinsics are usually transparent to the middle-end,

They aren't transparent, the entire point of intrinsics is they have compiler known behavior. Trying to special case call simply breaks fast math flags. They must have uniform interpretation

@phoebewang
Copy link
Contributor

phoebewang commented Mar 16, 2024

They have to have one semantic meaning. Special casing the meaning is simply a bad design

It's specific to the characteristics of fast math flags and call instructions. We can explain it in LangRef. And speaking of bad design, it is already bad to mix "value related" flags with "action related" in the same group. I think it would be more clear if we have two groups flags defined separately.

No. There is nothing special about call. It's simply an alternative representation of an IR operation. It cannot have special rules for common flags

It's special when combining call with fast math flags. It's undocumented, but I'll reaffirm my previous interpretations:

  • Fast math flags have inherent scope, it is only used for the instruction it associated with with a few exceptions;
  • The only exceptions are reassoc and contract, which only meaningful when used together with other instructions at the present;

As I explained above, the exceptions result from LLVM lacking a mechanism to represent a scope of multi fadd and fmul instructions. So the current usage looks like using conjuncted same flags to represent the scope.

If we interpret fast math flags along this way, it is nature reassoc vector.reduce only means reassociate inside the vector.reduce.

Furthermore, it implies middle-end only has limited permission in optimizing fast math flags, considering the general optimizations in middle-end are across instructions.

This is not to say middle-end won't do optimization like transforming afn llvm.sqrt into a sequence instructions. The key is the combination of fast math flags and intrinsic call is invariant to the backend. That says

  • Middle-end cannot drop a single fast math flag if it doesn't optimize the intrinsic;
  • The intrinsic must be transformed to a different one if middle-end does transformation;
  • The intrinsic must be identical to the one created by the front end (means elements won't be shuffled by middle-end in the case of reassoc vector.reduce) when passed to the backend;

The third bullet looks controversial, but I have proved it be guaranteed in theory.

They aren't transparent, the entire point of intrinsics is they have compiler known behavior. Trying to special case call simply breaks fast math flags. They must have uniform interpretation

I interpreted it from backend's perspective. By transparent, I mean middle-end either bypass it or otherwise transform it to a different one. So when the backend sees one intrinsic, it must have been transparent to the middle-end.

Another reason is call means all intrinsics follow the same rule, but it is not true for fast math flags. Most intrinsics don't support fast math flags, besides, some intrinsics like reassoc vector.reduce is forbidden to do middle-end optimization (explained previously). Apparently, it's more clear to just list those intrinsics that allowed to do transform by middle-end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants