-
Notifications
You must be signed in to change notification settings - Fork 429
RecordReader for TFRecords #240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
BTW I'm a Googler. Do I still need to sign this with my new email address? |
} | ||
/// Convert the Reader into an Iterator<Item = Result<Vec<u8>, RecordReadError>, which iterates | ||
/// the whole file. | ||
pub fn into_iter_owned(self) -> impl Iterator<Item = Result<Vec<u8>, RecordReadError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge, without GATs, there is no way to write an iterator that yields items that borrow from an internal buffer (so no way to reuse a buffer). Let me know if I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think my knowledge on Rust and TF is not enough yet to give an good answer. Do you perhaps have some other people you can ask for a good review? (Or perhaps we should ask in the community?)
Also I don't have much time recently. Otherwise I will try give a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, we do need GATs before we can write iterators that return a reference to internal state, but there is a workaround of returning e.g. an Rc<RefCell<Vec<u8>>>
. The downsides are the runtime overhead (which is probably minimal in this case compared to reading the data, performing the CRC check, etc) and the fact that users could hold onto the internal buffer for longer than intended and modify it, which is a much larger concern. The mutability concern could be addressed by having RecordOwnedIterator
return a new type, e.g. RecordOwnedIteratorItem
struct which implements Borrow<[u8]>
and contains the Rc<RefCell<Vec<u8>>>
internally, but I suspect there's no way to control the lifetime without using actual references and lifetimes.
All in all, I'm happy with the signature as-is, especially since users can always use read_next
directly.
I registered my github account with Google using the corp self-service tool, but that didn't make this bot happy. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
I squashed these commits, added usage examples in the comments, and changed the email to something that makes cla happy. Please take another look at this PR. Oh, and I consolidated |
src/io.rs
Outdated
/// Err(e) => { warn!("{:?}", e); break } | ||
/// } | ||
/// } | ||
pub fn read_next(&mut self, buf: &mut [u8]) -> Result<Option<u64>, RecordReadError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is problematic if you don't know the maximum record size with certainty, because if the buffer is too small, it can't be retried with a larger buffer, because the record length has already been read. Either RecordReader could use a BufReader internally and do an initial non-consuming read to get the record length, so that the record size could still be read if buf
is too small and the function needs to be read again, or it could track the record size explicitly in its state if it has been consumed but the data hasn't, or this function could take a &mut Vec<u8>
rather than &mut [u8]
so it could be grown as necessary to hold the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
I went with the following approach:
pub fn peek_next_len()
, which tells the length of the next record, and can be used to resize a heap buffer in advance of theread_next
call. Rather than wrap into aBufReader
internally, I addedSeek
bounds, giving control to the caller. In most cases they will be passingBufReader
s anyway.read_next()
skips records which are longer than the supplied buffer. A subsequent call resumes on the next record in the file. If the user doesn't know the max size, they can peek first.
} | ||
/// Convert the Reader into an Iterator<Item = Result<Vec<u8>, RecordReadError>, which iterates | ||
/// the whole file. | ||
pub fn into_iter_owned(self) -> impl Iterator<Item = Result<Vec<u8>, RecordReadError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, we do need GATs before we can write iterators that return a reference to internal state, but there is a workaround of returning e.g. an Rc<RefCell<Vec<u8>>>
. The downsides are the runtime overhead (which is probably minimal in this case compared to reading the data, performing the CRC check, etc) and the fact that users could hold onto the internal buffer for longer than intended and modify it, which is a much larger concern. The mutability concern could be addressed by having RecordOwnedIterator
return a new type, e.g. RecordOwnedIteratorItem
struct which implements Borrow<[u8]>
and contains the Rc<RefCell<Vec<u8>>>
internally, but I suspect there's no way to control the lifetime without using actual references and lifetimes.
All in all, I'm happy with the signature as-is, especially since users can always use read_next
directly.
d1d1e3c
to
a71d663
Compare
return Err(RecordReadError::CorruptFile); | ||
} | ||
return Err(RecordReadError::CorruptRecord); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified to:
if !len_ok {
Err(RecordReadError::CorruptFile)
} else if !bytes_ok {
Err(RecordReadError::CorruptRecord)
} else {
Ok(Some(len as usize))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh. I think I was a little tired last night.
Looks good, thanks a lot! |
Thanks for the great reviews |
Thank you both very much on this 😄 |
I guess I left the work unfinished when I submitted RecordWriter a few years ago. I was hoping some upstream issues would be resolved (e.g., GATs for efficient iteration, and I was expecting the Read trait to change to deal with the uninitialized buffers problem).
I think this is the best that can be done for right now, and it's better to have something than nothing.