-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[mlir][Transforms] Dialect conversion: Add missing "else if" branch #101148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mlir][Transforms] Dialect conversion: Add missing "else if" branch #101148
Conversation
This code got lost in #97213 and there was no test for it. Add it back with an MLIR test. When a pattern is run without a type converter, we can assume that the new block arugments of a signature conversion are legal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified this fixes #97213 (comment).
@@ -1328,15 +1328,19 @@ Block *ConversionPatternRewriterImpl::applySignatureConversion( | |||
mapping.map(origArg, argMat); | |||
appendRewrite<ReplaceBlockArgRewrite>(block, origArg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that I noticed while debugging around here, the line 1329 are adding to the map and adding a pending block argument rewrite, and so is line 1349 below. Also the builtin.unresolve_conversion_cast
generated in line 1325 is overriden by the one generated in line 1345 in some cases (with the mapping changed as well). I dont think that was intended. Might just be a harmless issue, but might be hiding a bug there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two mapping steps: origArg -> argMat -> targetMat
. Note that the mapping is not overwritten here. In the second step, we map argMat
, not origArg
.
And we also generate two unrealized_conversion_cast
ops. Such casts cannot be folded, unless they type(origArg) == type(targetMat)
.
It was actually on purpose that two casts in a row are generated here. The first one is an argument materialization and the second one is target materialization. Depending on the configuration of the type converter, we may not generate unrealized_conversion_cast
, but custom ops.
We got a failing test where it now fails to convert a type. It seems we are exactly hitting the case where no converter is specified. The test continues to work if I apply this patch:
Does this patch look reasonable? I tried to look at the earlier code that you tried to restore here, and it seems the mapping stuff is still missing, so I tried adding it back. |
That could work, but we should then probably build the argument materialization only when there is a type converter. When there is none, and the number of replArgs is 1, directly take that argument and do not build any materializations. But I'd like to understand first why the current implementation does not work. I don't see any conceptual problem with it. Can you provide a reproducer that can run with upstream MLIR? We really have to improve our test coverage of the dialect conversion framework. |
Also, what kind of failure are you seeing? Is it a crash? |
No, not a crash, it just does not convert where before it did. One of the failing tests is this one: |
I do not have a TensorFlow setup right now. Can you post the previous IR and the new IR? Maybe I can tell from there what's going on. Also, this is the kind of change that I had in mind: #101318. Can you check if that fixes the test? Even if it does, we still have to understand why where the problem is. And add a test to upstream MLIR that can run without TensorFlow. |
Confirmed, your patch also makes sure that the test still passes. Here is the IR in the faling case:
Before:
As far as I can tell, it stays the same. But the test was written with the expectation that it changes the type of the while to (tensor, tensor<2x3xf32>) -> (tensor, tensor<2x3xf32>) |
Full expected output:
|
Which one of these listings is the one produced with top-of-tree MLIR? Are you saying that #101318 makes the test pass, but it passes by coincidence? |
In the first two listings, it looks like the |
Agree, in the first two listings it looks like nothing was modified. This starts happening with the revision from the PR. The second listing is with your #101318 patched in, but this also matches the behavior from before this PR.
|
Also we have other internal failing tests, it seems they would also be fixed with your #101318 |
Sorry, I'm still confused. So the expected behavior is this: With #101148, the result type is With #101318 (on top of #101148), the result type is also Is that accurate? With
So we can see what patterns are being applied. I was hoping that there is a similar flag for |
So is this a solution that would fix everything? We still need some kind of reproducer to understand what's going on here (and to prevent this from breaking again in the future; assuming that it's actually broken). Can you copy together some patterns and IR (only the part that's needed to reproduce this) from your code base that triggers the issue? Maybe not the I'd like to put a test in |
I only spot-checked some failures, and those would be fixed. I can double-check whether it fixes everything. I got the -debug flag working, forgot to pass --copt=-UNDEBUG. Here is some relevant part of the output:
|
Confirmed, all tests would be fixed with #101318 |
Passing over to @anlunx to provide a reproducer. |
Yes, that's the log that I was looking for. This is with #101318? It would be interesting to see what's happening with top-of-tree MLIR. |
This was without #101318, so the case where the test fails. I notice there are now two whiles, one of them being the one we actually want, but somehow in the final IR we will have only the other one. |
It is possible that the pattern succeeds at first, but then it rolls back. So we should take a look at the entire output. |
(Apologies, I am upgrading internal LLVM without knowing the MLIR stuff going on.) Would #101318 be merged soon? If not and this PR introduced regression, could this PR be reverted temporarily? |
We don't know yet if this is a regression or incorrect API usage in your code base. So far, I can't see any problem with this commit. (1) If this is a regression: Merge #101318, what's still missing is a test case. (2) If this is not a regression: Fix broken code. I know this is blocking your LLVM integrate, but this commit fixes a bug that blocked another project's (IREE) LLVM integrate, so rolling back is not ideal. Ideally, I'd like to merge #101318 only with a test case. Our test coverage of the dialect conversion framework in MLIR is not good, and that's the reason why we have breakages like this one. So I'd say the next step is to write an MLIR-only reproducer. Then I can merge this first thing tomorrow morning. Waiting for @akuegel or @anlunx... |
The patch will fix the failures I observed. I hope that @akuegel or @anlunx will provide a test case later but I am going to merge #101318 now... |
@matthias-springer It turns out it was incorrect usage of the API. A workaround was added for the issue you were fixing in this PR. When removing the workaround, the tests pass. Sorry for wasting your time, and thanks for all the help. |
This code got lost in #97213 and there was no test for it. Add it back with an MLIR test.
When a pattern is run without a type converter, we can assume that the new block argument types of a signature conversion are legal. That's because they were specified by the user. This won't work for 1->N conversions due to limitations in the dialect conversion infrastructure, so the original
FIXME
has to stay in place.