Optimize take/filter/concat from multiple input arrays to a single large output array

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Upstream in DataFusion, there is a common common pattern where we have multiple input `RecordBatch`es and want to produce an output `RecordBatch` with some subset of the rows from the input batches. This happens in
- `FilterExec` --> `CoalesceBatchesExec`  when filtering
- `RepartitionExec` --> `CoalesceBatchesExec`

The kernels used here are:
- `FilterExec` uses [`filter`](https://docs.rs/arrow/latest/arrow/compute/kernels/filter/index.html), takes a single input `Array` and produces a single output `Array`
- `RepartitionExec` uses  [`take`](https://docs.rs/arrow/latest/arrow/compute/kernels/take/index.html), which also  takes a single input `Array` and produces a single output `Array``RepartitionExec`each take a single input batch and produce a single output `Array`
- `CoalesceBatchesExec` calls [`concat`](https://docs.rs/arrow/latest/arrow/compute/kernels/concat/index.html) which takes multple Arrays and produces a single Array as output

The use of these kernels and patterns has two downsides:
1. **Performance overhead due to a second copy**: Calling `filter`/`take` immediately  copies the  data, which is copied *again* in `CoalesceBatches` (see illustration below)
2.  **Memory Overhead / Performance Overhead for GarbageCollecting StringView**: Buffering up several `RecordBatch`es  with  StringView may consume significant amounts of memory for mostly filtered rows, which requires us to run  gc periodically which actually slows some things down (see https://github.com/apache/datafusion/issues/11628)

Here is an ascii art picture (from https://github.com/apache/datafusion/issues/7957) that shows the extra copy in action
```

┌────────────────────┐        Filter                                                                          
│                    │                    ┌────────────────────┐            Coalesce                          
│                    │    ─ ─ ─ ─ ─ ─ ▶   │    RecordBatch     │             Batches                          
│    RecordBatch     │                    │   num_rows = 234   │─ ─ ─ ─ ─ ┐                                   
│  num_rows = 8000   │                    └────────────────────┘                                              
│                    │                                                    │                                   
│                    │                                                                ┌────────────────────┐  
└────────────────────┘                                                    │           │                    │  
┌────────────────────┐                    ┌────────────────────┐                      │                    │  
│                    │        Filter      │                    │          │           │                    │  
│                    │                    │    RecordBatch     │           ─ ─ ─ ─ ─ ▶│                    │  
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   num_rows = 500   │─ ─ ─ ─ ─ ┐           │                    │  
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │  
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │  
│                    │                    └────────────────────┘                      │                    │  
└────────────────────┘                                                                │                    │  
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │  
          ...                   ...                                       │           │                    │  
                                                                                      │                    │  
┌────────────────────┐                                                    │           └────────────────────┘  
│                    │                    ┌────────────────────┐                                              
│                    │       Filter       │                    │          │                                   
│    RecordBatch     │                    │    RecordBatch     │                                              
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   num_rows = 333   │─ ─ ─ ─ ─ ┘                                   
│                    │                    │                    │                                              
│                    │                    └────────────────────┘                                              
└────────────────────┘                                                                                        
                                                                                                              
                      FilterExec                                          RepartitonExec copies the data      
                      creates output batches with copies                  *again* to form final large         
                      of  the matching rows (calls take()                 RecordBatches                       
                      to make a copy)                                                                         
                                                                                                              
```


**Describe the solution you'd like**

I would like to apply `filter`/`take` to each incoming `RecordBatch `as it arrives, copying the data to an in progress output array, in a way that is as fast as the `filter` and `take` operations. This would reduce the extra copy that is currently required. 

Note this is somewhat like the [`interleave` kernel](https://docs.rs/arrow/latest/arrow/compute/kernels/interleave/fn.interleave.html), except that 
1. We only need the output rows to be in the same order as the input batches (so the second `usize` batch index is not needed)
3. We don't want to have to buffer all the input 


**Describe alternatives you've considered**

One thing I have thought about is extending the builders so they can append more than one row at a time. For example:
- `Builder::append_filtered`
- `Builder::append_take` 


So for example, to filter a stream of StringViewArrays I might do something like;
```rust
let mut builder = StringViewBuilder::new();
while let Some(input) = stream.next() {
  // compute some subset of input rows that make it to the output
  let filter: BooleanArray = compute_filter(&input, ....); 
  // append all rows from input where filter[i] is true
  builder.append_filtered(&input, &filter);
}
```

And also add an equivalent for `append_take`

I think if we did this right, it wouldn't be a lot of new code, we could just refactor the existing filter/take implementations. For example, I would expect that the `filter` kernel would then devolve into something like
```rust
fn filter(..) {
  match data_type {
    DataType::Int8 => Int8Builder::with_capacity(...)
      .append_filter(input, filter)
      .build()
...
}
```


**Additional context**
- https://github.com/apache/datafusion/issues/7957
- https://github.com/apache/datafusion/issues/13188
- https://github.com/apache/datafusion/pull/12996




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize take/filter/concat from multiple input arrays to a single large output array #6692

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize take/filter/concat from multiple input arrays to a single large output array #6692

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions