Skip to content

Consider implementing some sort of deduplicate / intern functionality for StringView #5910

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of implementing StringView #5374

@XiangpengHao implemented gc which compacts all the strings in a StringView/BinaryView into contiguous storage in #5513

However, that functionality does not deduplicate/intern the strings -- it just copies them over

Describe the solution you'd like

We should make it easy to deduplicate the strings in a StringView.

I do think we should change gc to do deduplication without an explict as (as deduplication is expensive)

Describe alternatives you've considered

  1. Do nothing (users can implement their own version of this code without any addtional apis)
  2. Add a new function (e.g. GenericBinaryView::dedupe) that deduplicated such arrays (likely not moving any strings, but just updating views)
  3. Add an argument to GenericBinaryView::gc that controlled the behavior (as in could also specify doing gc)

Additional context
@alexwilcoxson-rel asked in #5904 (comment)

Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently?

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions