Skip to content

Improve performance for grouping by variable length columns (strings) #9403

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

As always I would like faster aggregation performance

Describe the solution you'd like

clickbench, Q17 and Q18 include

SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

This is an Int 64 and string

DataFusion CLI v36.0.0
❯ describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
...
| UserID                | Int64     | NO          |
...
| SearchPhrase          | Utf8      | NO          |
...
+-----------------------+-----------+-------------+
105 rows in set. Query took 0.035 seconds.

In some profiling of Q19, SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; I found that 20-30% of the time is spent going from Array --> Row or Row --> Array.

Thus I think adding some special handling for variable length data vs fixed length data in the group management may help

Background

GroupValuesRows, used for the queries above, is here:
https://github.com/apache/arrow-datafusion/blob/edec4189242ab07ac65967490537d77e776aad5c/datafusion/physical-plan/src/aggregates/group_values/row.rs#L32

Given a query like SELECT ... GROUP BY i1, i2, s1, where i1 and i2 are integer columns and s1 is a string column

For input looks like this:

                       ┌─────┐ ┌─────────┐                                                       
                       │  0  │ │TheQuickB│                                                       
┌─────┐   ┌─────┐      ├─────┤ │rownFox..│                                                       
│  1  │   │ 10  │      │ 100 │ │.FooSomeO│               In the input Arrow Arrays, variable     
├─────┤   ├─────┤      ├─────┤ │therVeryL│               length columns have offsets into other  
│  2  │   │ 20  │      │ 103 │ │argeStrin│               buffers                                 
├─────┤   ├─────┤      ├─────┤ │gs       │                                                       
│  5  │   │ 50  │      │ 300 │ │         │                                                       
├─────┤   ├─────┤      ├─────┤ └─────────┘                                                       
│ ... │   │ ... │      │ ... │  data (s1)                                                        
├─────┤   ├─────┤      ├─────┤                                                                   
│  6  │   │ 60  │      │ 600 │                                                                   
└─────┘   └─────┘      └─────┘                                                                   
                       offsets (s1)                                                              
                                                                                                 
   i1        i2                s1                                                                

GroupValuesRows will do

┌────────────────────────────┐                                                                   
│1|10|TheQuickBrownFox....   │                                                                   
└────────────────────────────┘                           In GroupValuesRows, each input row is   
                                                         copied into Row format (simplified      
┌───────────┐                                            version shown here), including the      
│2|20|Foo   │                                            entire string content                   
└───────────┘                                                                                    
                                                                                                 
┌────────────────────────────────────┐                                                           
│3|30|SomeOtherVeryLargeString...    │                                                           
└────────────────────────────────────┘                                                           

One downside of this approach is that for "large" strings, a substantial amount of copying is required simply to check if the group is already present

Describe alternatives you've considered

The idea is to use a modified version of the group keys where the fixed length part still uses row format, but the variable length columns use an approach like in GroupValuesByes

Something like

 ┌────┐   ┌────────────────┐                        
 │1|10│   │    offset 0    │             ┌─────────┐
 └────┘   │     len 3      │             │FooTheQui│
          └────────────────┘             │ckBrownFo│
 ┌────┐   ┌────────────────┐             │x...SomeO│
 │2|20│   │    offset 3    │             │therVeryL│
 └────┘   │    len 100     │             │argeStrin│
          └────────────────┘             │gs       │
 ┌────┐                                  │         │
 │3|30│   ┌────────────────┐             └─────────┘
 └────┘   │   offset 103   │                data    
          │    len 200     │                        
Use Rows  └────────────────┘               out of   
  for                                       line    
 fixed        offsets +                    buffer   
  part     lengths for each                 for     
            variable part                           
                                                    

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions