-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The performance of DataFusion is a very important feature, and aggregation is one of the core algorithms.
For example it features prominently in the ClickBench queries #5404
This ticket covers making it even faster.
Background
Prior Art
After #4924 we have a single GroupByHash implementation that uses the Arrow row format for group keys (good!) -- thank you @ozankabak and @mustafasrepo
However, the aggregator (e.g. the thing storing partial sums, counts, etc) is non ideal. Depending on the type of aggregate, it is either:
Box<dyn RowAccumulator>
perBox<dyn Accumulator>
The GroupByHash operator manages the mapping from group key to state and passes a RowAccessor
to the RowAccumulator
if needed.
Groups are calculated once per batch, then a take
kernel reshuffles (aka copies) everything and slices are passed to the RowAccumulator
/ Accumulator
.
Proposal
Ditch the word-aligned row format for the state management and change the aggregator to:
trait Aggregator: MemoryConsumer {
/// Update aggregator state for given groups.
fn update_batch(&mut self, keys: Rows, batch: &RecordBatch)) -> Result<()>;
...
}
Hence the aggregator will be dyn-dispatched ONCE per record batch and will keep its own internal state. This moves the key->state map from the [row_]hash.rs
to the aggregators. We will provide tooling (macros or generics) to simplify the implementation and to avoid boilerplate code as much as possible.
Note that this also removes the take
kernel since we think that it doesn't provide any performance improvements over iterating over the hashable rows and perform the aggregation row-by-row. We may bring the take
handling back (as a "perform take
if at least one aggregator wants that) if we can proof (via benchmarks) that this is desirable for certain aggregators, but we leave this out for now.
This also moves the memory consumer handling into the aggregator since the aggregator knows best how to split states and spill data.
Implementation Plan
TBD
Addtional Context
Note this was broken out of #2723 and is based on the wonderful writeup from @crepererum and @tustvold on #2723 (comment)