-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add case expr simplifiers for literal comparisons #17743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jackkleeman -- this looks very cool and a very nice feature 👌
I had some code suggestions, and some questions about the details of one of the rewrites, but otherwise I think it is very close.
A very nice contribution indeed
Literal pushdown
Interestingly, I would probably have called this "pull up" as it sort of inlines the equality into the THEN expressions. No need to change anything
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Outdated
Show resolved
Hide resolved
&& when_then_expr.len() < 3 // The rewrite is O(n²) so limit to small number | ||
// The rewrite is O(n²) in general so limit to small number of when-thens that can be true | ||
&& (when_then_expr.len() < 3 // small number of input whens | ||
// or all thens are literal bools and a small number of them are true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be valuable to do this rewrite regardless of the terms as long as all the arguments are boolean literals (why still limit to three true cases?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the comment about it being O(n^2) is very legit, and this is also true in the literal bool case; the thens are literal bools but we dont know anything about the whens and whether they might be reducible. Indeed, in my target use case, the thens are not reducible at all and we definitely end up with On^2 boolean expressions. The intent of this check is to maintain the '<3 when then' limitation, but just be cleverer about how we count when thens, ignoring those where the then is a literal false
We could, if we wanted, split this out into a separate rewrite that removes when_thens that have false thens. However, this is not as trivial as the when=false trim PR, and I actually think it could lead to longer case statements - you would have to fold in the when of the trimmed branch into all the following whens, as they would not have matched if the trimmed when had matched. Else is also complex in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- let's keep the check then
// ELSE false | ||
// END) | ||
// | ||
// Note: the rationale for this rewrite is that the case can then be further |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why rewrite adding a NOT
make it simpler / easier to simplify?
In recent days we have done several other CASE simplifications related to constants such as
We have one more open (perhaps you can help nudge it along) which I think would result in simplifying the expressions
- Simplify
CASE WHEN false THEN ...
#17590. Optimize CASE expressions by removing WHEN false branches #17628
Once #17628 is complete, do you think this rewrite would still add value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets say we have:
CASE
WHEN col_1 THEN true
WHEN col_2 THEN false
...
WHEN col_99 THEN true
ELSE true
END
none of the WHEN can be trimmed, so I am not sure those open prs will help. the naive disaggregation, if we removed the <3 limit will look like:
col1 OR (not col_1 and col_2) or (not col_1 and not col_2 and col_3)...
which is a very long expression indeed (O(n^2) in the number of true branches). instead, if we invert, we can write an expression like this:
not(not col_1 and col_2) -> col_1 or not(col_2)
which is O(n^2) in the number of false branches (a much smaller number)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great example, maybe we can add it to the comments (as a follow on PR)
// NOT(CASE WHEN c2 != false THEN true ELSE false) | ||
// --> | ||
// NOT(c2) | ||
assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very nice
Co-authored-by: Andrew Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jackkleeman -- this is a really nice improvement.
&& when_then_expr.len() < 3 // The rewrite is O(n²) so limit to small number | ||
// The rewrite is O(n²) in general so limit to small number of when-thens that can be true | ||
&& (when_then_expr.len() < 3 // small number of input whens | ||
// or all thens are literal bools and a small number of them are true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- let's keep the check then
// ELSE false | ||
// END) | ||
// | ||
// Note: the rationale for this rewrite is that the case can then be further |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great example, maybe we can add it to the comments (as a follow on PR)
🚀 |
I am interested in predicates like:
We already have a very nice 'case disaggregating' simplifier that can turn cases with boolean outputs into boolean logic statements. This PR adds some new functionality to make that simplifier more effective for literal comparisons:
Literal pushdown
We 'push down' literal comparisons into the case:
This leads to the output of the case statement to be a boolean, enabling the case disaggregation, but it also leads 'then' expressions being simplified into boolean literals, which enables extra optimisations below.
Disaggregate larger case expressions if only a small number of paths can be true
We update the case disaggregation logic to allow >=3 when_then expressions if we know for sure that <3 of the when_thens can ever output true. This is knowable if the thens are literal booleans ie null, false or true - we know that any literal null or false thens will be dropped from the final expression, avoiding the O(n^2) explosion that necessitates the limit in the first place.
Case inversion
While the case disaggregation is awesome for small cases or cases with a small number of literal 'true' outputs, its not good for cases with a small number of literal 'false' outputs but a large number of literal 'true' outputs. This is common for comparisons like (CASE ... != "a") as typically all but one when_then will output 'true'.
In those cases, we invert the case like so:
In doing so, we can turn it into a case that has a small number of literal 'true' outputs which will be picked up by the case disaggregation logic and turned into boolean logic statements. We rely on the fact that NOT(NOT(a)) === a for all literal bools.