Skip to content

Optimize take/filter/concat from multiple input arrays to a single large output array #6692

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Upstream in DataFusion, there is a common common pattern where we have multiple input RecordBatches and want to produce an output RecordBatch with some subset of the rows from the input batches. This happens in

  • FilterExec --> CoalesceBatchesExec when filtering
  • RepartitionExec --> CoalesceBatchesExec

The kernels used here are:

  • FilterExec uses filter, takes a single input Array and produces a single output Array
  • RepartitionExec uses take, which also takes a single input Array and produces a single output Array``RepartitionExeceach take a single input batch and produce a single output Array
  • CoalesceBatchesExec calls concat which takes multple Arrays and produces a single Array as output

The use of these kernels and patterns has two downsides:

  1. Performance overhead due to a second copy: Calling filter/take immediately copies the data, which is copied again in CoalesceBatches (see illustration below)
  2. Memory Overhead / Performance Overhead for GarbageCollecting StringView: Buffering up several RecordBatches with StringView may consume significant amounts of memory for mostly filtered rows, which requires us to run gc periodically which actually slows some things down (see Reduce copying in CoalesceBatchesExec for StringViews datafusion#11628)

Here is an ascii art picture (from apache/datafusion#7957) that shows the extra copy in action


┌────────────────────┐        Filter                                                                          
│                    │                    ┌────────────────────┐            Coalesce                          
│                    │    ─ ─ ─ ─ ─ ─ ▶   │    RecordBatch     │             Batches                          
│    RecordBatch     │                    │   num_rows = 234   │─ ─ ─ ─ ─ ┐                                   
│  num_rows = 8000   │                    └────────────────────┘                                              
│                    │                                                    │                                   
│                    │                                                                ┌────────────────────┐  
└────────────────────┘                                                    │           │                    │  
┌────────────────────┐                    ┌────────────────────┐                      │                    │  
│                    │        Filter      │                    │          │           │                    │  
│                    │                    │    RecordBatch     │           ─ ─ ─ ─ ─ ▶│                    │  
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   num_rows = 500   │─ ─ ─ ─ ─ ┐           │                    │  
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │  
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │  
│                    │                    └────────────────────┘                      │                    │  
└────────────────────┘                                                                │                    │  
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │  
          ...                   ...                                       │           │                    │  
                                                                                      │                    │  
┌────────────────────┐                                                    │           └────────────────────┘  
│                    │                    ┌────────────────────┐                                              
│                    │       Filter       │                    │          │                                   
│    RecordBatch     │                    │    RecordBatch     │                                              
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   num_rows = 333   │─ ─ ─ ─ ─ ┘                                   
│                    │                    │                    │                                              
│                    │                    └────────────────────┘                                              
└────────────────────┘                                                                                        
                                                                                                              
                      FilterExec                                          RepartitonExec copies the data      
                      creates output batches with copies                  *again* to form final large         
                      of  the matching rows (calls take()                 RecordBatches                       
                      to make a copy)                                                                         
                                                                                                              

Describe the solution you'd like

I would like to apply filter/take to each incoming RecordBatch as it arrives, copying the data to an in progress output array, in a way that is as fast as the filter and take operations. This would reduce the extra copy that is currently required.

Note this is somewhat like the interleave kernel, except that

  1. We only need the output rows to be in the same order as the input batches (so the second usize batch index is not needed)
  2. We don't want to have to buffer all the input

Describe alternatives you've considered

One thing I have thought about is extending the builders so they can append more than one row at a time. For example:

  • Builder::append_filtered
  • Builder::append_take

So for example, to filter a stream of StringViewArrays I might do something like;

let mut builder = StringViewBuilder::new();
while let Some(input) = stream.next() {
  // compute some subset of input rows that make it to the output
  let filter: BooleanArray = compute_filter(&input, ....); 
  // append all rows from input where filter[i] is true
  builder.append_filtered(&input, &filter);
}

And also add an equivalent for append_take

I think if we did this right, it wouldn't be a lot of new code, we could just refactor the existing filter/take implementations. For example, I would expect that the filter kernel would then devolve into something like

fn filter(..) {
  match data_type {
    DataType::Int8 => Int8Builder::with_capacity(...)
      .append_filter(input, filter)
      .build()
...
}

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions