Implement hash-partitioned hash aggregate

This is (up to 4x in my earlier tests) faster than the current implementation that collects all parts to one "full" for cases with very high cardinality in the aggregate (think deduplication code). However, not hash partitioning is faster for very "simple" aggregates as less work needs to be done.

We probably need some fast way to have a rough estimate on the number of distinct values in the aggregate keys, maybe dynamically based on the first batch(es).

Also this work creates a building block for ballista to distribute data across workers, parallelizing it, avoiding collecting it to one worker, and making it scale to bigger datasets.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement hash-partitioned hash aggregate #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement hash-partitioned hash aggregate #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions