Skip to content

Add gcgarbage collector support for StringViewArray and BinaryViewArray #5513

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is part of the larger project to implement StringViewArray -- see #5374

In #5481 we added support for StringViewArray and ByteViewArray.

This ticket tracks adding a gc method to StringViewArray and ByteViewArray

After calling filter or take on a StringViewArray or ByteViewArray the backing variable length buffer may be much larger than necessary to store the results

So before an array may look like the following with significant "garbage" space

                                       ┌──────┐                 
                                       │......│                 
                                       │......│                 
┌────────────────────┐       ┌ ─ ─ ─ ▶ │Data1 │   Large buffer  
│       View 1       │─ ─ ─ ─          │......│  with data that 
├────────────────────┤                 │......│ is not referred 
│       View 2       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or 
└────────────────────┘                 │......│      View 2     
                                       │......│                 
   2 views, refer to                   │......│                 
  small portions of a                  └──────┘                 
     large buffer                                               
                                                                
                                                                

After GC it should look like

┌────────────────────┐                 ┌─────┐    After gc, only 
│       View 1       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│     data that is  
├────────────────────┤       ┌ ─ ─ ─ ▶ │Data2│    pointed to by  
│       View 2       │─ ─ ─ ─          └─────┘     the views is  
└────────────────────┘                                 left      
                                                                 
                                                                 
        2 views                                                  
                                                                 
                                                                 
                                                                 

Describe the solution you'd like
I would like to add a method called StringViewArray::gc (and ByteViewArray::gc) that will compact

I expect users of the arrow crates to invoke this function, not any of the arrow kernels themselves

Describe alternatives you've considered

We could also add the gc functionality as its own standalone kernel (e.g. kernels::gc) rather than a method on the array.

Additional context
This GC is what is described in https://pola.rs/posts/polars-string-type/

What I consider the biggest downside is that we have to do garbage collection. When we gather/filter from an array with allocated long strings, we might keep strings alive that are not used anymore. This requires us to use some heuristics to determine when we will do a garbage collection pass on the string column. And because they are heuristics, sometimes they will be unneeded.

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions