Skip to content

Request: a way to copy a Row to Rows #4466

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I am implementing GroupByHash in DataFusion apache/datafusion#4973

We use the RowFormat to store grouping keys which is awesome.

The Grouping operation calculates the Row format for each input row, determines if they have been seen previously, and if not stores the newly seen Row. The only way I know of today is to copy each new row individually using owned():

┌──────────────────────────────────┐                                                            
│ ┌───────────────────────────────┐│                                                            
│ │               A               ││                                                            
│ ├───────────────────────────────┤│                                                            
│ │               B               │├────────────┐                                               
│ ├───────────────────────────────┤│            │                                               
│ │               A               ││            │                                               
│ ├───────────────────────────────┤│            │                                               
│ │               A               ││            │           ┌──────────────────────────────────┐
│ ├───────────────────────────────┤│            │           │ ┌───────────────────────────────┐│
│ │               C               ││            │           │ │               A               ││
│ ├───────────────────────────────┤│            │           │ └───────────────────────────────┘│
│ │               B               ││            │           │ ┌───────────────────────────────┐│
│ ├───────────────────────────────┤│            └───────────┼▶│               B               ││
│ │               A               ││                        │ └───────────────────────────────┘│
│ ├───────────────────────────────┤│  to add a new row, I   │                                  │
│ │               A               ││  currently do          │                                  │
│ └───────────────────────────────┘│  `Row::owned()` to     │                                  │
│  group keys for input batch      │  get a copy            │   distinct group keys seen in    │
│  often many repeated values      │                        │   previous batches               │
│                                  │                        │                                  │
└──────────────────────────────────┘                        └──────────────────────────────────┘
                                                                                                
     arrow_row::Rows                                         Vec<arrow_row::OwnedRow>           
                                                                                                

Describe the solution you'd like
I would like to be able to append a Row directly to a Rows:

┌──────────────────────────────────┐                                                            
│ ┌───────────────────────────────┐│                                                            
│ │               A               ││                                                            
│ ├───────────────────────────────┤│                                                            
│ │               B               │├────────────┐                                               
│ ├───────────────────────────────┤│            │                                               
│ │               A               ││            │                                               
│ ├───────────────────────────────┤│            │                                               
│ │               A               ││            │           ┌──────────────────────────────────┐
│ ├───────────────────────────────┤│            │           │ ┌───────────────────────────────┐│
│ │               C               ││            │           │ │               A               ││
│ ├───────────────────────────────┤│            │           │ ├───────────────────────────────┤│
│ │               B               ││            └───────────┼▶│               B               ││
│ ├───────────────────────────────┤│                        │ └───────────────────────────────┘│
│ │               A               ││                        │                                  │
│ ├───────────────────────────────┤│  Copying a new Row     │                                  │
│ │               A               ││  would just copy       │                                  │
│ └───────────────────────────────┘│  some bytes to the     │                                  │
│  group keys for input batch      │  other Rows            │   distinct group keys seen in    │
│  often many repeated values      │                        │   previous batches               │
│                                  │                        │                                  │
└──────────────────────────────────┘                        └──────────────────────────────────┘
                                                                                                
   arrow_row::Rows                                            arrow_row::Rows                   
                                                                                                

Describe alternatives you've considered

Currently my POC code uses Vec<OwnedRow> which adds an extra allocation for each row 😢

Additional context
apache/datafusion#4973

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions