-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Upstream in DataFusion, there is a common common pattern where we have multiple input RecordBatch
es and want to produce an output RecordBatch
with some subset of the rows from the input batches. This happens in
FilterExec
-->CoalesceBatchesExec
when filteringRepartitionExec
-->CoalesceBatchesExec
The kernels used here are:
FilterExec
usesfilter
, takes a single inputArray
and produces a single outputArray
RepartitionExec
usestake
, which also takes a single inputArray
and produces a single outputArray``RepartitionExec
each take a single input batch and produce a single outputArray
CoalesceBatchesExec
callsconcat
which takes multple Arrays and produces a single Array as output
The use of these kernels and patterns has two downsides:
- Performance overhead due to a second copy: Calling
filter
/take
immediately copies the data, which is copied again inCoalesceBatches
(see illustration below) - Memory Overhead / Performance Overhead for GarbageCollecting StringView: Buffering up several
RecordBatch
es with StringView may consume significant amounts of memory for mostly filtered rows, which requires us to run gc periodically which actually slows some things down (see Reduce copying inCoalesceBatchesExec
for StringViews datafusion#11628)
Here is an ascii art picture (from apache/datafusion#7957) that shows the extra copy in action
┌────────────────────┐ Filter
│ │ ┌────────────────────┐ Coalesce
│ │ ─ ─ ─ ─ ─ ─ ▶ │ RecordBatch │ Batches
│ RecordBatch │ │ num_rows = 234 │─ ─ ─ ─ ─ ┐
│ num_rows = 8000 │ └────────────────────┘
│ │ │
│ │ ┌────────────────────┐
└────────────────────┘ │ │ │
┌────────────────────┐ ┌────────────────────┐ │ │
│ │ Filter │ │ │ │ │
│ │ │ RecordBatch │ ─ ─ ─ ─ ─ ▶│ │
│ RecordBatch │ ─ ─ ─ ─ ─ ─ ▶ │ num_rows = 500 │─ ─ ─ ─ ─ ┐ │ │
│ num_rows = 8000 │ │ │ │ RecordBatch │
│ │ │ │ └ ─ ─ ─ ─ ─▶│ num_rows = 8000 │
│ │ └────────────────────┘ │ │
└────────────────────┘ │ │
... ─ ─ ─ ─ ─ ▶│ │
... ... │ │ │
│ │
┌────────────────────┐ │ └────────────────────┘
│ │ ┌────────────────────┐
│ │ Filter │ │ │
│ RecordBatch │ │ RecordBatch │
│ num_rows = 8000 │ ─ ─ ─ ─ ─ ─ ▶ │ num_rows = 333 │─ ─ ─ ─ ─ ┘
│ │ │ │
│ │ └────────────────────┘
└────────────────────┘
FilterExec RepartitonExec copies the data
creates output batches with copies *again* to form final large
of the matching rows (calls take() RecordBatches
to make a copy)
Describe the solution you'd like
I would like to apply filter
/take
to each incoming RecordBatch
as it arrives, copying the data to an in progress output array, in a way that is as fast as the filter
and take
operations. This would reduce the extra copy that is currently required.
Note this is somewhat like the interleave
kernel, except that
- We only need the output rows to be in the same order as the input batches (so the second
usize
batch index is not needed) - We don't want to have to buffer all the input
Describe alternatives you've considered
One thing I have thought about is extending the builders so they can append more than one row at a time. For example:
Builder::append_filtered
Builder::append_take
So for example, to filter a stream of StringViewArrays I might do something like;
let mut builder = StringViewBuilder::new();
while let Some(input) = stream.next() {
// compute some subset of input rows that make it to the output
let filter: BooleanArray = compute_filter(&input, ....);
// append all rows from input where filter[i] is true
builder.append_filtered(&input, &filter);
}
And also add an equivalent for append_take
I think if we did this right, it wouldn't be a lot of new code, we could just refactor the existing filter/take implementations. For example, I would expect that the filter
kernel would then devolve into something like
fn filter(..) {
match data_type {
DataType::Int8 => Int8Builder::with_capacity(...)
.append_filter(input, filter)
.build()
...
}
Additional context