Description
Is your feature request related to a problem or challenge?
As always I would like faster aggregation performance
Describe the solution you'd like
clickbench, Q17 and Q18 include
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
This is an Int 64 and string
DataFusion CLI v36.0.0
❯ describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------+-------------+
...
| UserID | Int64 | NO |
...
| SearchPhrase | Utf8 | NO |
...
+-----------------------+-----------+-------------+
105 rows in set. Query took 0.035 seconds.
In some profiling of Q19, SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
I found that 20-30% of the time is spent going from Array --> Row or Row --> Array.
Thus I think adding some special handling for variable length data vs fixed length data in the group management may help
Background
GroupValuesRows
, used for the queries above, is here:
https://github.com/apache/arrow-datafusion/blob/edec4189242ab07ac65967490537d77e776aad5c/datafusion/physical-plan/src/aggregates/group_values/row.rs#L32
Given a query like SELECT ... GROUP BY i1, i2, s1
, where i1
and i2
are integer columns and s1
is a string column
For input looks like this:
┌─────┐ ┌─────────┐
│ 0 │ │TheQuickB│
┌─────┐ ┌─────┐ ├─────┤ │rownFox..│
│ 1 │ │ 10 │ │ 100 │ │.FooSomeO│ In the input Arrow Arrays, variable
├─────┤ ├─────┤ ├─────┤ │therVeryL│ length columns have offsets into other
│ 2 │ │ 20 │ │ 103 │ │argeStrin│ buffers
├─────┤ ├─────┤ ├─────┤ │gs │
│ 5 │ │ 50 │ │ 300 │ │ │
├─────┤ ├─────┤ ├─────┤ └─────────┘
│ ... │ │ ... │ │ ... │ data (s1)
├─────┤ ├─────┤ ├─────┤
│ 6 │ │ 60 │ │ 600 │
└─────┘ └─────┘ └─────┘
offsets (s1)
i1 i2 s1
GroupValuesRows
will do
┌────────────────────────────┐
│1|10|TheQuickBrownFox.... │
└────────────────────────────┘ In GroupValuesRows, each input row is
copied into Row format (simplified
┌───────────┐ version shown here), including the
│2|20|Foo │ entire string content
└───────────┘
┌────────────────────────────────────┐
│3|30|SomeOtherVeryLargeString... │
└────────────────────────────────────┘
One downside of this approach is that for "large" strings, a substantial amount of copying is required simply to check if the group is already present
Describe alternatives you've considered
The idea is to use a modified version of the group keys where the fixed length part still uses row format, but the variable length columns use an approach like in GroupValuesByes
Something like
┌────┐ ┌────────────────┐
│1|10│ │ offset 0 │ ┌─────────┐
└────┘ │ len 3 │ │FooTheQui│
└────────────────┘ │ckBrownFo│
┌────┐ ┌────────────────┐ │x...SomeO│
│2|20│ │ offset 3 │ │therVeryL│
└────┘ │ len 100 │ │argeStrin│
└────────────────┘ │gs │
┌────┐ │ │
│3|30│ ┌────────────────┐ └─────────┘
└────┘ │ offset 103 │ data
│ len 200 │
Use Rows └────────────────┘ out of
for line
fixed offsets + buffer
part lengths for each for
variable part
Additional context
No response