Skip to content

Improve the performance of Aggregator, grouping, aggregation #4973

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The performance of DataFusion is a very important feature, and aggregation is one of the core algorithms.

For example it features prominently in the ClickBench queries #5404

This ticket covers making it even faster.

Background

Prior Art

After #4924 we have a single GroupByHash implementation that uses the Arrow row format for group keys (good!) -- thank you @ozankabak and @mustafasrepo

However, the aggregator (e.g. the thing storing partial sums, counts, etc) is non ideal. Depending on the type of aggregate, it is either:

  1. Box<dyn RowAccumulator> per
  2. Box<dyn Accumulator>

The GroupByHash operator manages the mapping from group key to state and passes a RowAccessor to the RowAccumulator if needed.

Groups are calculated once per batch, then a take kernel reshuffles (aka copies) everything and slices are passed to the RowAccumulator / Accumulator.

Proposal

Ditch the word-aligned row format for the state management and change the aggregator to:

trait Aggregator: MemoryConsumer {
    /// Update aggregator state for given groups.
    fn update_batch(&mut self, keys: Rows, batch: &RecordBatch)) -> Result<()>;
    ...
}

Hence the aggregator will be dyn-dispatched ONCE per record batch and will keep its own internal state. This moves the key->state map from the [row_]hash.rs to the aggregators. We will provide tooling (macros or generics) to simplify the implementation and to avoid boilerplate code as much as possible.

Note that this also removes the take kernel since we think that it doesn't provide any performance improvements over iterating over the hashable rows and perform the aggregation row-by-row. We may bring the take handling back (as a "perform take if at least one aggregator wants that) if we can proof (via benchmarks) that this is desirable for certain aggregators, but we leave this out for now.

This also moves the memory consumer handling into the aggregator since the aggregator knows best how to split states and spill data.

Implementation Plan

TBD

Addtional Context

Note this was broken out of #2723 and is based on the wonderful writeup from @crepererum and @tustvold on #2723 (comment)

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions