Skip to content

StringView: Using the Interleave kernel (and potentially others) results in many repeated buffers in variadic_buffers #6780

@alamb

Description

@alamb

Describe the bug
Quoting @onursatici from #6779:

Currently interleaving ByteViewArrays are done with the fallback implementation, which uses a MutableArrayBuilder. The extend method on this builder copies all variadic buffers because it doesn't know if there are buffers not referenced by any views in the array.

Especially on datafusion's TopK implementation, which uses a heap that interleaves arrow arrays to produce the top k rows, current interleave implementation results in an explosion of variadic buffer count for byte view arrays, adding the same set of buffers over and over again. Where this becomes really problematic is when sending such arrays over flight, current encoder materialises all variadic buffers.

This also came up recently on #6779 from @ShiKaiWi and a converstaion with @tustvold @XiangpengHao and myself here: #6427 (comment)

To Reproduce
Call interleave or concat with a bunch of StringViewArrays (I think)

Expected behavior
(ideally) if an existing buffer is already in a StringViewArray's variadic_buffer list it shouldn't be added again

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow cratebugnext-major-releasethe PR has API changes and it waiting on the next major version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions