Skip to content

Optimize hash_aggregate when there are no null group keys #850

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The code in hash_aggregate.rs is general and works for data with and without nulls. However there are optimizations that can be done. One such optimization is suggested by @andygrove and @Dandandan on #844 (comment), namely add an optimized code path when there are no NULL values in the input groups that will avoid the cost of checking for null on each group.

While this might sound trivial the null check is on the hot path (done for every single row that is grouped) so removing it may improve performance by a measurable amount.

Describe the solution you'd like

  1. A new function or parameter in ScalarVaue::eq_array (e.g. ScalarValue::eq_array_non_null) that assumes the input has no nulls and does not check Array::is_valid
  2. A check in hash_aggregate if the null count in all group columns is 0 and invokes the specialized version of ScalarValue::eq_array_non_null if so
  3. Some sort of performance benchmark results showing that it improves grouping performance (there is a list of benchmarks on Rework GroupByHash to for faster performance and support grouping by nulls #808 that might be able to inspire you)

Describe alternatives you've considered
The performance benefit may not be worth the additional code complexity, but we won't know until we try

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions