Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is part of the larger project to implement StringViewArray
-- see #5374
In #5481 we added support for StringViewArray
and ByteViewArray
.
This ticket tracks adding a gc
method to StringViewArray
and ByteViewArray
After calling filter
or take
on a StringViewArray
or ByteViewArray
the backing variable length buffer may be much larger than necessary to store the results
So before an array may look like the following with significant "garbage" space
┌──────┐
│......│
│......│
┌────────────────────┐ ┌ ─ ─ ─ ▶ │Data1 │ Large buffer
│ View 1 │─ ─ ─ ─ │......│ with data that
├────────────────────┤ │......│ is not referred
│ View 2 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
└────────────────────┘ │......│ View 2
│......│
2 views, refer to │......│
small portions of a └──────┘
large buffer
After GC it should look like
┌────────────────────┐ ┌─────┐ After gc, only
│ View 1 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│ data that is
├────────────────────┤ ┌ ─ ─ ─ ▶ │Data2│ pointed to by
│ View 2 │─ ─ ─ ─ └─────┘ the views is
└────────────────────┘ left
2 views
Describe the solution you'd like
I would like to add a method called StringViewArray::gc
(and ByteViewArray::gc
) that will compact
I expect users of the arrow crates to invoke this function, not any of the arrow kernels themselves
Describe alternatives you've considered
We could also add the gc
functionality as its own standalone kernel (e.g. kernels::gc
) rather than a method on the array.
Additional context
This GC is what is described in https://pola.rs/posts/polars-string-type/
What I consider the biggest downside is that we have to do garbage collection. When we gather/filter from an array with allocated long strings, we might keep strings alive that are not used anymore. This requires us to use some heuristics to determine when we will do a garbage collection pass on the string column. And because they are heuristics, sometimes they will be unneeded.