Skip to content

Potential performance improvements for reading Parquet to StringViewArray/BinaryViewArray #5904

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In https://github.com/apache/arrow-rs/issues/ @ariesdevil @XiangpengHao and I implemented pretty fast reading of data in Parquet to Arrow StringViewArray

The solution we have so far is #5877 which doesn't copy the string data 🎉 , but does track a set of offsets which are then converted into StringViewArray

@ariesdevil had a more comprehensive approach in #5557 that built the StringViews directly from the encoded data but hadn't yet removed the string copies

Describe the solution you'd like
It may be worth looking at the StringViewDecoding to see if there is more performance to be had.

Specifically we can se the arrow_array_reader/StringViewArray and related benchmarks to profile and identify any additional potential improvements

Describe alternatives you've considered
It may be good enough now

Additional context

Metadata

Metadata

Assignees

Labels

arrowChanges to the arrow crateparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions