Skip to content

Conversation

jackkleeman
Copy link
Contributor

I am interested in predicates like:

CASE WHEN X THEN "a" WHEN Y THEN "b" ... END = "a"
CASE WHEN X THEN "a" WHEN Y THEN "b" ... END != "a"

We already have a very nice 'case disaggregating' simplifier that can turn cases with boolean outputs into boolean logic statements. This PR adds some new functionality to make that simplifier more effective for literal comparisons:

Literal pushdown

We 'push down' literal comparisons into the case:

CASE WHEN X THEN "a" WHEN Y THEN "b" ... END = "a"
->
CASE WHEN X THEN "a"  = "a" WHEN Y THEN "b" = "a" ... END

This leads to the output of the case statement to be a boolean, enabling the case disaggregation, but it also leads 'then' expressions being simplified into boolean literals, which enables extra optimisations below.

Disaggregate larger case expressions if only a small number of paths can be true

We update the case disaggregation logic to allow >=3 when_then expressions if we know for sure that <3 of the when_thens can ever output true. This is knowable if the thens are literal booleans ie null, false or true - we know that any literal null or false thens will be dropped from the final expression, avoiding the O(n^2) explosion that necessitates the limit in the first place.

Case inversion

While the case disaggregation is awesome for small cases or cases with a small number of literal 'true' outputs, its not good for cases with a small number of literal 'false' outputs but a large number of literal 'true' outputs. This is common for comparisons like (CASE ... != "a") as typically all but one when_then will output 'true'.

In those cases, we invert the case like so:

CASE WHEN X THEN true WHEN Y THEN false ... END
->
NOT(CASE WHEN X THEN NOT(true) WHEN Y THEN NOT(false) ... END)

In doing so, we can turn it into a case that has a small number of literal 'true' outputs which will be picked up by the case disaggregation logic and turned into boolean logic statements. We rely on the fact that NOT(NOT(a)) === a for all literal bools.

@github-actions github-actions bot added the optimizer Optimizer rules label Sep 23, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jackkleeman -- this looks very cool and a very nice feature 👌

I had some code suggestions, and some questions about the details of one of the rewrites, but otherwise I think it is very close.

A very nice contribution indeed

Literal pushdown

Interestingly, I would probably have called this "pull up" as it sort of inlines the equality into the THEN expressions. No need to change anything

&& when_then_expr.len() < 3 // The rewrite is O(n²) so limit to small number
// The rewrite is O(n²) in general so limit to small number of when-thens that can be true
&& (when_then_expr.len() < 3 // small number of input whens
// or all thens are literal bools and a small number of them are true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be valuable to do this rewrite regardless of the terms as long as all the arguments are boolean literals (why still limit to three true cases?)

Copy link
Contributor Author

@jackkleeman jackkleeman Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the comment about it being O(n^2) is very legit, and this is also true in the literal bool case; the thens are literal bools but we dont know anything about the whens and whether they might be reducible. Indeed, in my target use case, the thens are not reducible at all and we definitely end up with On^2 boolean expressions. The intent of this check is to maintain the '<3 when then' limitation, but just be cleverer about how we count when thens, ignoring those where the then is a literal false

We could, if we wanted, split this out into a separate rewrite that removes when_thens that have false thens. However, this is not as trivial as the when=false trim PR, and I actually think it could lead to longer case statements - you would have to fold in the when of the trimmed branch into all the following whens, as they would not have matched if the trimmed when had matched. Else is also complex in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- let's keep the check then

// ELSE false
// END)
//
// Note: the rationale for this rewrite is that the case can then be further
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why rewrite adding a NOT make it simpler / easier to simplify?

In recent days we have done several other CASE simplifications related to constants such as

We have one more open (perhaps you can help nudge it along) which I think would result in simplifying the expressions

Once #17628 is complete, do you think this rewrite would still add value?

Copy link
Contributor Author

@jackkleeman jackkleeman Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets say we have:

CASE
  WHEN col_1 THEN true
  WHEN col_2 THEN false
...
  WHEN col_99 THEN true
  ELSE true
END

none of the WHEN can be trimmed, so I am not sure those open prs will help. the naive disaggregation, if we removed the <3 limit will look like:

col1 OR (not col_1 and col_2) or (not col_1 and not col_2 and col_3)... 

which is a very long expression indeed (O(n^2) in the number of true branches). instead, if we invert, we can write an expression like this:

not(not col_1 and col_2) -> col_1 or not(col_2)

which is O(n^2) in the number of false branches (a much smaller number)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great example, maybe we can add it to the comments (as a follow on PR)

// NOT(CASE WHEN c2 != false THEN true ELSE false)
// -->
// NOT(c2)
assert_eq!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very nice

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jackkleeman -- this is a really nice improvement.

&& when_then_expr.len() < 3 // The rewrite is O(n²) so limit to small number
// The rewrite is O(n²) in general so limit to small number of when-thens that can be true
&& (when_then_expr.len() < 3 // small number of input whens
// or all thens are literal bools and a small number of them are true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- let's keep the check then

// ELSE false
// END)
//
// Note: the rationale for this rewrite is that the case can then be further
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great example, maybe we can add it to the comments (as a follow on PR)

@alamb alamb added the performance Make DataFusion faster label Sep 25, 2025
@alamb alamb added this pull request to the merge queue Sep 26, 2025
@alamb
Copy link
Contributor

alamb commented Sep 26, 2025

🚀

Merged via the queue into apache:main with commit 783c45b Sep 26, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants