-
Notifications
You must be signed in to change notification settings - Fork 94
Open
Labels
Description
I've spent some time digging into HDF5 conversion API and it seems like it actually works! As in, we can force it to "understand" Rust string types and convert back and forth. Given the painful experience with strings and arrays (#86, #47, #85), this could be a huge win in usability.
The same can be done with varlen/fixed arrays/strings (direct conversions to/from &[T]
, Vec<T>
, String
, &str
, etc).
Price to pay: extra memory allocation. If the dataset is not chunked, it will (at some point in the conversion path) use double the required memory. If it is chunked, I think it will process it chunk by chunk so the cost could be negligible.
There's many details to consider and discuss, this is just a start and an experiment. Details below.
Metadata
Metadata
Assignees
Labels
Projects
Milestone
Relationships
Development
Select code repository
Activity
aldanor commentedon Aug 6, 2020
Test file:
aldanor commentedon Aug 6, 2020
Prototype:
aldanor commentedon Aug 6, 2020
Output:
aldanor commentedon Aug 6, 2020
TLDR: we have an HDF5 dataset with type
|S26
and we read it directly into aVec<String>
and it sort of seems to work.aldanor commentedon Aug 6, 2020
@magnusuMET There you go as promised ^ 😄
aldanor commentedon Aug 6, 2020
Just verified, the conversion routine indeed runs chunk by chunk. So, if you're converting a dataset with 1K strings but chunk size is 100, you will allocate memory for at most 1100 strings at a time (this would be the advantage as opposed to "read all, then convert" approach).
magnusuMET commentedon Aug 7, 2020
@aldanor That is some really great stuff! So it sort of acts as an inplace conversion? Nasty trick of copying the layout of the String 👍
aldanor commentedon Aug 7, 2020
Yea, it is in-place in a sense that
String
body (pretty hefty, 24B) is generated in place, obviously not the heap data it points to.One could argue it's not the most efficient way of doing things etc, but given that it allows you to map directly to Rust types, I think convenience outweighs everything else. Typically, if you want performance, you won't be using strings at all in the first place :)
Note also that this would automatically work for structs as well, any
String
field wrapped in a struct or array would automatically be decoded in place.aldanor commentedon Jan 27, 2021
Just to add to the above so I don't forget, we could totally do something like that (I could probably take up on that once the dust settles over the current blockers), BUT: this will require splitting
H5Type
intoH5Read
andH5Write
. I.e., you can write&str
orString
but you can only readString
.