Closed
Description
This is (up to 4x in my earlier tests) faster than the current implementation that collects all parts to one "full" for cases with very high cardinality in the aggregate (think deduplication code). However, not hash partitioning is faster for very "simple" aggregates as less work needs to be done.
We probably need some fast way to have a rough estimate on the number of distinct values in the aggregate keys, maybe dynamically based on the first batch(es).
Also this work creates a building block for ballista to distribute data across workers, parallelizing it, avoiding collecting it to one worker, and making it scale to bigger datasets.
Metadata
Metadata
Assignees
Labels
No labels