-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Labels
arrowChanges to the arrow crateChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelog
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I am implementing GroupByHash in DataFusion apache/datafusion#4973
We use the RowFormat
to store grouping keys which is awesome.
The Grouping operation calculates the Row
format for each input row, determines if they have been seen previously, and if not stores the newly seen Row
. The only way I know of today is to copy each new row individually using owned()
:
┌──────────────────────────────────┐
│ ┌───────────────────────────────┐│
│ │ A ││
│ ├───────────────────────────────┤│
│ │ B │├────────────┐
│ ├───────────────────────────────┤│ │
│ │ A ││ │
│ ├───────────────────────────────┤│ │
│ │ A ││ │ ┌──────────────────────────────────┐
│ ├───────────────────────────────┤│ │ │ ┌───────────────────────────────┐│
│ │ C ││ │ │ │ A ││
│ ├───────────────────────────────┤│ │ │ └───────────────────────────────┘│
│ │ B ││ │ │ ┌───────────────────────────────┐│
│ ├───────────────────────────────┤│ └───────────┼▶│ B ││
│ │ A ││ │ └───────────────────────────────┘│
│ ├───────────────────────────────┤│ to add a new row, I │ │
│ │ A ││ currently do │ │
│ └───────────────────────────────┘│ `Row::owned()` to │ │
│ group keys for input batch │ get a copy │ distinct group keys seen in │
│ often many repeated values │ │ previous batches │
│ │ │ │
└──────────────────────────────────┘ └──────────────────────────────────┘
arrow_row::Rows Vec<arrow_row::OwnedRow>
Describe the solution you'd like
I would like to be able to append a Row
directly to a Rows
:
┌──────────────────────────────────┐
│ ┌───────────────────────────────┐│
│ │ A ││
│ ├───────────────────────────────┤│
│ │ B │├────────────┐
│ ├───────────────────────────────┤│ │
│ │ A ││ │
│ ├───────────────────────────────┤│ │
│ │ A ││ │ ┌──────────────────────────────────┐
│ ├───────────────────────────────┤│ │ │ ┌───────────────────────────────┐│
│ │ C ││ │ │ │ A ││
│ ├───────────────────────────────┤│ │ │ ├───────────────────────────────┤│
│ │ B ││ └───────────┼▶│ B ││
│ ├───────────────────────────────┤│ │ └───────────────────────────────┘│
│ │ A ││ │ │
│ ├───────────────────────────────┤│ Copying a new Row │ │
│ │ A ││ would just copy │ │
│ └───────────────────────────────┘│ some bytes to the │ │
│ group keys for input batch │ other Rows │ distinct group keys seen in │
│ often many repeated values │ │ previous batches │
│ │ │ │
└──────────────────────────────────┘ └──────────────────────────────────┘
arrow_row::Rows arrow_row::Rows
Describe alternatives you've considered
Currently my POC code uses Vec<OwnedRow>
which adds an extra allocation for each row 😢
Additional context
apache/datafusion#4973
Metadata
Metadata
Assignees
Labels
arrowChanges to the arrow crateChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelog