-
Notifications
You must be signed in to change notification settings - Fork 96
Reading / writing #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was once interested in this library for exactly that purpose. It happens that the code for reading from an existing dataset is in a separate branch, and is currently not documented. It might also be incomplete for certain data types. Maybe the owner could use some help. We might be able to join forces around this functionality and document the working parts for future users. |
@Enet4 @cathalgarvey Hi guys! @Enet4 is correct, the type mapping is in a separate branch (types). I've rewritten it at least 3 times and it's pretty much complete -- supports all builtin types, tuple types, fixed-size array types, all 4 kinds of strings (fixed/varlen, ascii/utf-8 -- this already makes it support more atomics than e.g. pytables / h5py), plus it uses procedural macros to bind structs and nested structs, like so This way, most of the super-low level stuff that deals with types is already done. The next one is Dataset API. I have a few implementations but I don't like neither of them too much for different reasons. One is The ideal thing would be to get rid of all traits altogether (e.g. Yes I could def. use some help -- I've no problem implementing things, and I know HDF5 core C API very well, most of the pain is in decision making. |
Hmm, so let's think...
I would imagine having an implementation of
But yes, I suppose if the above can't be done, a more relaxed trait could be used.
I don't know much about how the |
I'm not sure I follow, would you care to expand?
Basically yes, interface inheritance based on the parameter type. Currently, there are a bunch of traits like trait ID { fn id() ... }
trait FromID {}
// note that these traits are "dummy", they don't have any methods to implement
trait Object : ID {}
trait Location : Object {}
trait Container : Location {}
struct Group { id: ... }
impl ID for Group { fn id(): ... }
impl Object for Group {}
impl Location for Group {}
impl Container for Group {}
struct PropertyList { id: ... }
impl ID for PropertyList { fn id(): ... }
impl Object for PropertyList {} There are a few problems with a trait-based inheritance here:
There're definitely more problems but these are the ones that come to mind first. This just feels like misusing Rust's traits (given that most of them have empty interfaces anyway, they're just "dummy"). What I wanted to do is inheritance based on type parameter so some of the problems listed above go away, like shown here: rust-lang/rust#32077 (comment). And it kind of worked if it wasn't for the rustdoc woes. |
// I've opened a bounty on bountysource for the rustdoc issue, who knows, maybe someone will get to fixing it :) |
I am also interested to an example of how to read a dataset. My hdf5 dataset is really simple, just some groups with matrix inside. I am still a rust beginner and could not find how to do it just by reading the documentation. Could you please provide a minimal example ? |
It's funny to find out someone's still tracking this :) Yep, after a long pause I've resumed work on the crate (I need it too and I need it working soon, myself!), you can see the current work in https://github.com/aldanor/hdf5-rs/tree/feature/2018. The crate now fully supports HDF5 1.10.0 and 1.10.1, too, on all platforms. As a matter of fact, I already have a local branch working that can read N-dimensional datasets into both The type system is already done, including procedural macros with The object system discussed earlier in this thread is done too, via I understand everyone's eager to have something they can use, but so am I, and I'm trying to juggle development on this crate with personal life on a daily basis as you can observe from the commit history... |
Hi @aldanor. We (@10XGenomics) are quite interested in Rust HDF5 support. We'd be happy to test / contribute to high-level Read/Write implementations whenever you're ready to share your branch, even if it's in a raw form. The new branch looks great so far! |
Hi folks, here's a topic for some brainstorming and bikeshedding, apologies in advance for a huge wall of text :) If you are interested in speeding this up, any ideas are definitely welcome. I'm currently trying to figure out what's the most ergonomic / least annoying while at the same point type-safe way of representing reads and writes. I have the code that does read n-dimensional arrays successfully but it needs to be wrapped nicely. For simplicity, let's say we're just talking about full-dataset/attribute reads and writes here. Here's a few facts:
How do we turn this into an API? In the current temporary prototype, I have something like this (all // both Attribute and Dataset dereference to Container
// Container dereferences into Location
impl Container {
pub fn read_vec<T: H5Type>(&self) -> Result<Vec<T>> { ... }
pub fn read_1d<T: H5Type>(&self) -> Result<Array1<T>> { ... }
pub fn read_2d<T: H5Type>(&self) -> Result<Array2<T>> { ... }
pub fn read_scalar<T: H5Type>(&self) -> Result<T> { ... }
pub fn read_dyn<T: H5Type>(&self) -> Result<ArrayD<T>> { ... }
pub fn read_arr<T: H5Type, D: Dimension>(&self) -> Result<Array<T, D>> { ... }
// also all write_*() methods
} When reading, it would check that the conversion path stored->memory is noop (that is, it would create an HDF5 datatype from There's a few drawbacks here:
... which leads to a possibility of having a impl Container {
pub fn read_vec<T: H5Type>(&self) -> Result<Vec<T>> { ... }
pub fn of_type<T: H5Type>(&self) -> Result<TypedContainer<T>> { ... } // noop
pub fn cast_soft<T: H5Type>(&self) -> Result<TypedContainer<T>> { ... }
// etc
}
impl<T: H5Type> TypedContainer<T> {
pub fn read_vec(&self) -> Result<Vec<T>> { ... }
// etc
} ... but here's one problem: if Then I was thinking, maybe it could be done builder-style? E.g. hypothetically, something like... let _ = dataset.read::<T>()?.vec()?; // read() returns ReadContainer
let _ = dataset.read_soft::<T>()?.arr_2d()?; As for writes, until trait specialization lands, we can't provide blanket impls for all let _ = dataset.write::<T>()?.scalar(&x)?; // write() returns WriteContainer
let _ = dataset.write_soft::<T>()?.arr(&arr)?;
let _ = dataset.write_hard::<T>()?.slice(&vec)?; (It might be somewhat confusing that For writes, there's ton of other options like compression etc, need to think of an ergonomic way to support that (which is currently supported only in the DatasetBuilder). At the very least, it should be possible to reshape the dataset on the fly -- e.g. if you provide the data in a slice/Vec but want to store it as a 2-D array, maybe something like let _ = dataset.write::<T>()?.shape((10, 20)).slice(&vec)?; Last thing to note, the methods above could coexist with their alternative forms, e.g. you could have all of these let _ = group.read_vec::<T>("foo")?;
let _ = group.dataset("foo")?.read_vec::<T>()?;
let _ = group.dataset("foo")?.read::<T>()?.vec()?; Something like that... Anyways, there's tons of ways to approach this, but we need to pick one :) |
Actually, pinging @bluss too -- hope he doesn't mind :) (since we're planning to use ndarray as the core backend here anyway) |
Cool, thanks for the background!
So to clarify, the proposal is that
One simple change would be to call the method
I really like the current |
@pmarks Thanks for the response. I'll try to address your points below.
Yes, that is correct (i.e., "noop"=noop, "hard"=noop|hard, "soft"=noop|hard|soft). It could also be possible to provide a way to pass conversion as an enum value (since the enum type exists anyway), like
To think about, that's my experience too. In my many years of dealing with HDF5 in all possible languages, contexts and platforms, I can't think of a case where I had both a read and a write on neighboring lines of code. It's typically either one or the other.
This might be one way, yea, hmm... Definitely heading into bikeshedding territory but, then I think you would need to add "read_/write_" prefixes back to its methods, like "dataset.get_reader::()?::read_arr()?" (so it reads more or less like an English sentence; kind of like with the currently proposed "dataset ... read ... array"). But then you have "read" twice and so it ends up being more verbose :) Need to ponder a bit on this.
Correct. Post-creation the only two classes of options you can configure are dataset-access and data-transfer plists. Note also that it should be possible to augment the DatasetBuilder with write functionality (kind of like in h5py you can provide
Me too. But the folks have been already asking above re: "whether datasets can be iterated to avoid loading them all into memory" etc, I'd expect people to have very different use cases so it'd be nice for the implementation to be flexible enough to suit most of them. |
As the owner of another multidimensional-array-oriented Rust library (nifti), I faced a similar situation when it comes to data element conversion from the original, "stored" data type to a Rust "memory" type. Its current stance is that elements are automatically converted in a way that reduces precision errors as much as possible (there's a scalar affine transformation in-between), but ultimately, the user should know what to expect ahead of time: if they tried reading a
We may interpret this as an adaptation of the receiver for reading or writing purposes. If we see this as a separate construct for reading and writing, then what's been suggested by @pmarks makes sense, but it is more conventional (C-GETTER) to name these methods The same naming conventions can be applied to the other ideas above (e.g. |
So like this?
One question is what type owns the H5D
Option 1 is incompatible with the above signature of
Sorry, I wasn't clear: partial loading of data slices is important to us. On-the-fly changing the number of dimension (ie reshaping) feels like it could be deferred, as long as decisions about simpler API don't close off options for a reshaping interface. We'll be happy to test / iterate on designs when you're ready to share an initial cut! |
Agreed. Reader/Writer is quite common terminology in stdlib as well.
I don't think that
Well, all the low-level ID stuff has already been implemented, each object holds an Options:
While writing all this, I've just realized, do we really need the writer to be typed? Or do we need it at all? (if do, we are basically not making use of Rust's type inference at all while we could be). So, instead of writing dataset.write_hard::<i32>()?.write_slice(&vec)?; Why not just... dataset.as_writer_hard().write_slice(&vec)?; Or dataset.write_slice_hard(&vec)?; (etc) With reads... I've just checked, it looks like in simplest cases type inference works, i.e. you can do stuff like this: let data: Vec<i32> = dataset.as_reader()?.read_vec()?; which is pretty neat. Whether let data: Vec<i32> = dataset.read_vec()?; (and for hard/soft conversions it would be much more verbose, but it will likely be used much more rarely) Slightly off-topic, here, but in regards to
As a minor side detail, I'd probably be inclined to add a Another minor note, if we even keep them at all, All in all, thanks for responses, this helps a lot :) |
Ok guys, reporting on some Christmas progress 😄
|
OK, I grabbed one of my own hdf5 files and managed to read it in Rust. 🎉 A bit of feedback:
|
@Enet4 Very nice, thanks! Will try to address your points below.
IIRC this was pretty much copied from h5py directly without giving it too much thought. I guess it would be nice to have some of the API resemble h5py by not introducing anything too cryptic, so it would be an easier transition for hdf5 users. Here's That being said, I already have a bullet point on my todo on reworking File constructor, e.g. enums and builders instead of strings (same goes for driver modes), to be more Rust-like; also maybe splitting open/create options, etc. So, currently it's like this (not saying it's ideal): let f = h5::File::open("foo.h5", "r")?; // Read-only, file must exist
let f = h5::File::open("foo.h5", "r+")?; // Read/write, file must exist
let f = h5::File::open("foo.h5", "w")?; // Create file, truncate if exists
let f = h5::File::open("foo.h5", "w-")?; // Create file, fail if exists (or "x")
let f = h5::File::open("foo.h5", "a")?; // Read/write if exists, create otherwise If we change it to be like exactly let f = h5::File::open("foo.h5")?;
let f = h5::File::with_options().write(true).open("foo.h5")?;
let f = h5::File::with_options().create(true).write(true).truncate(true).open("foo.h5")?;
let f = h5::File::with_options().create_new(true).write(true).open("foo.h5")?;
let f = h5::File::with_options().create(true).write(true).append(true).open("foo.h5")?; which hurt my eyes a bit TBH 😿 The first option becomes nice, but everything else is super ugly. I guess you could try simplifying the latter a bit given that there's a limited set of options here, and there's no concept of write-only in HDF5 so read is always implied. Also all options except "r" are writable, out of which all except one create a new file. Maybe just providing Maybe self-explanatory aliases like this? (open/create ctors exist in fs::File as well)
Maybe both.
Not gonna argue here of course :) (If/when this whole thing gets stable, I was planning to start a little mdbook on how to use this, with more general concerns like layouts and thread-safety, things not to do etc; along with some examples -- it's pretty hard to fit stuff like that into docstrings).
Yep, I know... any other solutions welcome, really, but I've been banging my head against the wall for a long time before switching to deref-based hierarchy. There were many different prototypes, the one before this (in the master) is trait-based, which is also quite horrible as the functionality is now split between tons of different traits that have to be in scope, and leads to many other ugly problems. What's nice about deref-based approach: (1) for example you can pass an What's not nice, mostly cosmetics/docs: (1) rustdoc doesn't go more than one level deep into deref impls; this is a known problem but unfortunately noone wants to spend time fixing it; (2) some IDEs like intellij rust may sometimes have problems with deep derefs. Other than that, although we're breaking some conventions here, I think it leads to better end-user experience (and the low-level stuff is pretty hardcore anyway, so a deref sure isn't the worst of it).
Hmm... but container module is public?
TBH I never really used that part myself, maybe because I typically have boxes with tons of memory :) But I've seen some h5py docs about it (the "slicing" / indexing bit, not the streaming). So yea, right now you have to fetch the whole part. How would the API for chunked/streamed reading look like, hypothetically, and which HDF5 API calls we are wrapping? (and does streamed reading make sense when there's by-chunk compression?). |
Re: file constructors, I was thinking, in all my experience with HDF5, I don’t think I have ever passed anything but a hard-coded string literal as With this in mind, maybe five pre-defined constructors for File indeed make sense? (with mode strings thrown out, and without any support in the builder so as not to clutter it). This means the builder would also need five finalisers I guess. Some of the builder’s options only make sense when creating a file so it’s not entirely clean, but that’s already a problem, and std::fs::OpenOptions suffers from it as well so it’s kinda normal. // Also, eventually, I'd like to split |
I would format them like this: let f = h5::File::with_options()
.create(true)
.write(true)
.append(true)
.open("foo.h5")?; Can't say it's very ugly in this form, but we're certainly in subjective grounds.
Right... But
Well, this is the fun part! To the best of my knowledge, there is no public attempt at something of this sort in Rust. I have a similar feature in mind for However, we do know that h5py data set objects are proxies to a disk backed array. When slicing is called, only the respective chunks of data need to be fetched from the file, only then resulting in an in-memory object of that slice. Caching is likely to be involved as well.
I feel that this form of reading makes even more sense when by-chunk compression is involved: it often means that we're dealing with very large data, so we don't want to keep it all in memory. The fact that h5py supports this might even be one of the reasons why some people keep using HDF5.
We can give that a try and see if it works out. My only concern is that a string-based mode is not very idiomatic. |
Doh, yea apologies, I've re-exported those at the root now (along with Conversion enum), pushed already. There's probably a few more 'hidden' types here and there now. Reason being: I was planning to add pure-reexport modules later at the crate root, aggregating types from different modules into categories - e.g.
Yea, I guess :) We'll sure get to it, but after there's a stable release with all basics working. Syntax-wise, may borrow some from ndarray, maybe even borrow their indexing machinery with traits and all.
Deal. Here's another possibility for Another thing about the API for creating datasets -- a very common pattern in h5py would be to create a dataset with data instead of creating an empty dataset with a given shape, and then writing to it later. It would be definitely nice to be able to do that, will need to give it some thought and rework The way h5py does it is pretty smart -- to avoid cluttering your file in case there's a write failure, it creates an anonymous dataset and writes to it, and only links it when the write succeeds (in Currently it's like this: // named
ds = group.new_dataset::<T>().some().options().create("foo", arr.shape())?;
ds.write(&arr)?;
// anonymous
ds = group.new_dataset::<T>().some().options().create("foo", arr.shape())?;
ds.write(&arr)?; Note how we have to specify the shape explicitly which is kind of redundant... Maybe: // named
group.create_dataset::<T>("foo").some().options().empty()?;
group.create_dataset::<T>("foo").some().options().write(&arr)?;
// anonymous
group.create_dataset::<T>(None).some().options().empty()?;
group.create_dataset::<T>(None).some().options().write(&arr)?; Here, Allowing both Important: note that the element type of What we don't have here though is an absolute lack of any type inference -- most (almost all) of the time when creating a new dataset and immediately writing data to it, we'll want datatypes to match? So to cover this most common case, we could force the finalizer ( // named
group.create_dataset("foo").some().options().write(&arr)?;
// anonymous
group.create_dataset(None).some().options().write(&arr)?; So now we don't have to provide the type and we don't have to provide the shape, both are inferred from the view. In the simplest "hello world" kind of case, when you're just creating a dataset and writing a vector of stuff to it, without configuring much, it would look like this, which is quite friendly, I think: group.create_dataset("foo").write(&vec)?; |
There's obviously further work required (and more bikeshedding), but basic multi-dimensional dataset reading/writing has been implemented and tested, so I think this can be closed for now. |
This may be a stupid question, but I see the Dataset.read method being used in your example code in issues like #9 and I can't actually find documentation, or source code, for it. :)
I guess examples of use would be very helpful, as in #9, but I thought at least I could find the code to help me understand things! I have lots of questions, about what kinds of structured-types can be stored in datasets, whether datasets can be iterated to avoid loading them all into memory, etc. etc.; happy to answer these questions myself if I know where to look.
Thanks for what appears to be a very complete and powerful library!
The text was updated successfully, but these errors were encountered: