perf: boolean group values implementations #17726

ashdnazg · 2025-09-22T15:48:28Z

Which issue does this PR close?

N/A

Rationale for this change

Having more data types supported in this code path avoids converting the group by columns to rows during aggregation.

What changes are included in this PR?

Adding boolean support for single and multiple grouping values.

Are these changes tested?

I added unit tests for the multi-value implementation and both single value and multi value implementations are tested in the sql logic tests.

Are there any user-facing changes?

No

…lues.

alamb

This looks great @ashdnazg -- thank you

I will try and review it carefully in the next day or two

alamb · 2025-09-23T17:39:57Z

datafusion/physical-plan/src/aggregates/group_values/mod.rs

            }
            DataType::Utf8 => {
-                return Ok(Box::new(GroupValuesByes::<i32>::new(OutputType::Utf8)));
+                return Ok(Box::new(GroupValuesBytes::<i32>::new(OutputType::Utf8)));


GroupValuesByes 🤦

alamb

Thank you @ashdnazg -- I reviewed the code and it all looks pretty good to me (follows the existing patterns)

However, I tested this locally with datafusion-cli and couldn't get it to show any difference. Do you have any benchmarks / queries that showed this would be faster?

Here is what I tried:

Table setup

> create or replace table foo as select random() as float_val, random() < 0.5 as bool_val, case (random() * 4)::integer when 0 THEN 'Foo' WHEN 1 then 'bar' when 2 then 'baz' else 'grogo' end as str_val  from generate_series(1, 1000000000);
0 row(s) fetched.
Elapsed 5.596 seconds.

> select * from foo limit 10;
+----------------------+----------+---------+
| float_val            | bool_val | str_val |
+----------------------+----------+---------+
| 0.003644060055853049 | true     | grogo   |
| 0.5430315943491387   | false    | Foo     |
| 0.0635246361601266   | false    | baz     |
| 0.09122350127376644  | true     | Foo     |
| 0.4325015383008821   | true     | baz     |
| 0.10141972676501176  | true     | Foo     |
| 0.6749389965920886   | false    | grogo   |
| 0.6319317066473308   | false    | grogo   |
| 0.07946106391385499  | true     | bar     |
| 0.5026330571571326   | true     | Foo     |
+----------------------+----------+---------+
10 row(s) fetched.
Elapsed 0.071 seconds.

Test query

And then ran

> select avg(float_val), bool_val, str_val from foo GROUP BY bool_val, str_val;

results

main: 1.645 seconds
This branch: 1.695 seconds

(basically no change)

🤔

alamb · 2025-09-25T18:08:54Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+    nulls: MaybeNullBufferBuilder,
+}
+
+impl<const NULLABLE: bool> BooleanGroupValueBuilder<NULLABLE> {


this code is pretty similar to PrimitiveGroupValueBuilder but I don't see any obvious way to improve that

Agreed. It's a recurring theme, I feel, that the boolean implementations are very similar to the primitive ones.

ashdnazg · 2025-09-25T19:49:15Z

Thank you for benchmarking! I thought I had some promising results, but maybe I was mistaken.
I'll look into it.

kosiew

Left a question.

kosiew · 2025-09-27T03:40:16Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+        new_builder.finish();
+
+        Arc::new(BooleanArray::new(new_builder.finish(), first_n_nulls))


Why call new_builder.finish() twice?

Thanks for catching that! I'll add a test to verify that take_n works correctly.

rluvaton · 2025-09-29T11:55:14Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+        lhs_rows: &[usize],
+        array: &ArrayRef,
+        rhs_rows: &[usize],
+        equal_to_results: &mut [bool],


Because this is a slice and not buffer this limit optimizations in my optimization for creating optimized version for all uniuqe, for example for non nullable checking if 2 arrays are the same is simple NOT XOR

That's beyond the scope of this PR

rluvaton · 2025-09-29T12:00:13Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/boolean.rs

+        lhs_rows: &[usize],
+        array: &ArrayRef,
+        rhs_rows: &[usize],
+        equal_to_results: &mut [bool],


I will try to change it to MutableBooleanBuffer or something in the future to allow for more optimizations

ashdnazg · 2025-09-29T14:20:25Z

@alamb
I added a boolean column to Query 10 in h2o_medium:

SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count, (v1 % 2)=0 AS b1 FROM x GROUP BY id1, id2, id3, id4, id5, id6, b1;

And I get the following results locally

Without the boolean impl:

Query 1 iteration 1 took 8836.5 ms and returned 100000000 rows
Query 1 iteration 2 took 8293.7 ms and returned 100000000 rows
Query 1 iteration 3 took 7957.9 ms and returned 100000000 rows
Query 1 avg time: 8362.69 ms

With the boolean impl

Query 1 iteration 1 took 5202.1 ms and returned 100000000 rows
Query 1 iteration 2 took 4666.2 ms and returned 100000000 rows
Query 1 iteration 3 took 4577.2 ms and returned 100000000 rows
Query 1 avg time: 4815.19 ms

In general the multi group by option seems to make certain scenarios worse.
I added a flag for easy control of whether it's used in this commit: ashdnazg@6a0b13b
and then checked:

create or replace table foo as select (random() * 4)::integer as int_val, (random() * 4)::integer as int_val2, (random() * 4)::integer as int_val3  from generate_series(1, 1000000000);
set datafusion.execution.enable_multi_group_by to false;
select int_val, int_val2, int_val3 from foo GROUP BY int_val, int_val2, int_val3;
set datafusion.execution.enable_multi_group_by to true;
select int_val, int_val2, int_val3 from foo GROUP BY int_val, int_val2, int_val3;

I get Elapsed 1.037 seconds. with the flag set to false and Elapsed 1.382 seconds. with the flag set to true.

Perhaps the logic for when to use it should be more careful than just "whenever it's supported". But I suspect this PR is not the right spot for that discussion.

ashdnazg added 2 commits September 22, 2025 14:58

chore: fix typos in group_values.

24c1aee

perf: add support for boolean columns in single and multi group by va…

126817c

…lues.

alamb reviewed Sep 23, 2025

View reviewed changes

alamb added the performance Make DataFusion faster label Sep 25, 2025

Merge branch 'main' into multi-group-builders

04ee663

github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 25, 2025

alamb reviewed Sep 25, 2025

View reviewed changes

kosiew reviewed Sep 27, 2025

View reviewed changes

rluvaton reviewed Sep 29, 2025

View reviewed changes

remove extra finish

c98922f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: boolean group values implementations #17726

perf: boolean group values implementations #17726

ashdnazg commented Sep 22, 2025 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Sep 23, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Sep 25, 2025

Uh oh!

ashdnazg Sep 25, 2025

Uh oh!

ashdnazg commented Sep 25, 2025

Uh oh!

kosiew left a comment

Uh oh!

kosiew Sep 27, 2025

Uh oh!

ashdnazg Sep 29, 2025

Uh oh!

rluvaton Sep 29, 2025 •

edited

Loading

Uh oh!

ashdnazg Sep 29, 2025

Uh oh!

rluvaton Sep 29, 2025

Uh oh!

ashdnazg Sep 29, 2025

Uh oh!

ashdnazg commented Sep 29, 2025

Uh oh!

Uh oh!

		new_builder.finish();

		Arc::new(BooleanArray::new(new_builder.finish(), first_n_nulls))

perf: boolean group values implementations #17726

Are you sure you want to change the base?

perf: boolean group values implementations #17726

Conversation

ashdnazg commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Table setup

Test query

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashdnazg commented Sep 25, 2025

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashdnazg commented Sep 29, 2025

Uh oh!

Uh oh!

ashdnazg commented Sep 22, 2025 •

edited

Loading

rluvaton Sep 29, 2025 •

edited

Loading