External aggregation reserves more memory than actual usage

### Describe the bug

The below query requires 65M memory to run, if we set memory limit to 50M, it can not run successfully
Run in datafusion-cli:
```
cargo run -- --mem-pool-type fair -m 50M -c "
select t1.v1,  sum(t2.v1)
from
unnest(generate_series(1,1000)) as t1(v1)
, unnest(generate_series(1,1000)) as t2(v1)
group by t1.v1, t2.v1"

Error: External error: Resources exhausted: Failed to allocate additional 47616 bytes for GroupedHashAggregateStream[0] with 3995896 bytes already allocated for this reservation - 4031073 bytes remain available for the total pool
```

The issue is when doing sort-merge memory usage is over-estimated
https://github.com/apache/datafusion/blob/f2da32b3bde851c34e9df0a2f4c174a5392f8897/datafusion/physical-plan/src/sorts/builder.rs#L72
For example, a RecordBatch with 3 arrays, arrays are sharing the same buffers, `record_batch.get_array_memory_size()` will estimate 3X actual memory consumption.
(The original `RecordBatch`es passing through datafusion operators don't share `Buffer` between different columns, but in spilling queries, `RecordBatch`es are first written to disk and read back, then it will reuse `Buffer`s among different column arrays)

The root cause is already reported in `arrow-rs` https://github.com/apache/arrow-rs/issues/6363
Once it's fixed in the arrow we should check if this aggregation query can run successfully, and also add tests.

### To Reproduce

_No response_

### Expected behavior

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

External aggregation reserves more memory than actual usage #13089

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

External aggregation reserves more memory than actual usage #13089

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions