Skip to content

Conversation

ashdnazg
Copy link
Contributor

@ashdnazg ashdnazg commented Sep 22, 2025

Which issue does this PR close?

N/A

Rationale for this change

Having more data types supported in this code path avoids converting the group by columns to rows during aggregation.

What changes are included in this PR?

Adding boolean support for single and multiple grouping values.

Are these changes tested?

I added unit tests for the multi-value implementation and both single value and multi value implementations are tested in the sql logic tests.

Are there any user-facing changes?

No

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @ashdnazg -- thank you

I will try and review it carefully in the next day or two

}
DataType::Utf8 => {
return Ok(Box::new(GroupValuesByes::<i32>::new(OutputType::Utf8)));
return Ok(Box::new(GroupValuesBytes::<i32>::new(OutputType::Utf8)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GroupValuesByes 🤦

@alamb alamb added the performance Make DataFusion faster label Sep 25, 2025
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 25, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ashdnazg -- I reviewed the code and it all looks pretty good to me (follows the existing patterns)

However, I tested this locally with datafusion-cli and couldn't get it to show any difference. Do you have any benchmarks / queries that showed this would be faster?

Here is what I tried:

Table setup

> create or replace table foo as select random() as float_val, random() < 0.5 as bool_val, case (random() * 4)::integer when 0 THEN 'Foo' WHEN 1 then 'bar' when 2 then 'baz' else 'grogo' end as str_val  from generate_series(1, 1000000000);
0 row(s) fetched.
Elapsed 5.596 seconds.

> select * from foo limit 10;
+----------------------+----------+---------+
| float_val            | bool_val | str_val |
+----------------------+----------+---------+
| 0.003644060055853049 | true     | grogo   |
| 0.5430315943491387   | false    | Foo     |
| 0.0635246361601266   | false    | baz     |
| 0.09122350127376644  | true     | Foo     |
| 0.4325015383008821   | true     | baz     |
| 0.10141972676501176  | true     | Foo     |
| 0.6749389965920886   | false    | grogo   |
| 0.6319317066473308   | false    | grogo   |
| 0.07946106391385499  | true     | bar     |
| 0.5026330571571326   | true     | Foo     |
+----------------------+----------+---------+
10 row(s) fetched.
Elapsed 0.071 seconds.

Test query

And then ran

> select avg(float_val), bool_val, str_val from foo GROUP BY bool_val, str_val;

results

  • main: 1.645 seconds
  • This branch: 1.695 seconds

(basically no change)

🤔

nulls: MaybeNullBufferBuilder,
}

impl<const NULLABLE: bool> BooleanGroupValueBuilder<NULLABLE> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is pretty similar to PrimitiveGroupValueBuilder but I don't see any obvious way to improve that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It's a recurring theme, I feel, that the boolean implementations are very similar to the primitive ones.

@ashdnazg
Copy link
Contributor Author

Thank you for benchmarking! I thought I had some promising results, but maybe I was mistaken.
I'll look into it.

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a question.

Comment on lines 190 to 192
new_builder.finish();

Arc::new(BooleanArray::new(new_builder.finish(), first_n_nulls))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call new_builder.finish() twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that! I'll add a test to verify that take_n works correctly.

lhs_rows: &[usize],
array: &ArrayRef,
rhs_rows: &[usize],
equal_to_results: &mut [bool],
Copy link
Contributor

@rluvaton rluvaton Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a slice and not buffer this limit optimizations in my optimization for creating optimized version for all uniuqe, for example for non nullable checking if 2 arrays are the same is simple NOT XOR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's beyond the scope of this PR

lhs_rows: &[usize],
array: &ArrayRef,
rhs_rows: &[usize],
equal_to_results: &mut [bool],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to change it to MutableBooleanBuffer or something in the future to allow for more optimizations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@ashdnazg
Copy link
Contributor Author

@alamb
I added a boolean column to Query 10 in h2o_medium:

SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count, (v1 % 2)=0 AS b1 FROM x GROUP BY id1, id2, id3, id4, id5, id6, b1;

And I get the following results locally

Without the boolean impl:

Query 1 iteration 1 took 8836.5 ms and returned 100000000 rows
Query 1 iteration 2 took 8293.7 ms and returned 100000000 rows
Query 1 iteration 3 took 7957.9 ms and returned 100000000 rows
Query 1 avg time: 8362.69 ms

With the boolean impl

Query 1 iteration 1 took 5202.1 ms and returned 100000000 rows
Query 1 iteration 2 took 4666.2 ms and returned 100000000 rows
Query 1 iteration 3 took 4577.2 ms and returned 100000000 rows
Query 1 avg time: 4815.19 ms

In general the multi group by option seems to make certain scenarios worse.
I added a flag for easy control of whether it's used in this commit: ashdnazg@6a0b13b
and then checked:

create or replace table foo as select (random() * 4)::integer as int_val, (random() * 4)::integer as int_val2, (random() * 4)::integer as int_val3  from generate_series(1, 1000000000);
set datafusion.execution.enable_multi_group_by to false;
select int_val, int_val2, int_val3 from foo GROUP BY int_val, int_val2, int_val3;
set datafusion.execution.enable_multi_group_by to true;
select int_val, int_val2, int_val3 from foo GROUP BY int_val, int_val2, int_val3;

I get Elapsed 1.037 seconds. with the flag set to false and Elapsed 1.382 seconds. with the flag set to true.

Perhaps the logic for when to use it should be more careful than just "whenever it's supported". But I suspect this PR is not the right spot for that discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make DataFusion faster physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants