Skip to content

RecordReader for TFRecords #240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 6, 2020
Merged

RecordReader for TFRecords #240

merged 2 commits into from
Mar 6, 2020

Conversation

masonk
Copy link
Contributor

@masonk masonk commented Feb 18, 2020

I guess I left the work unfinished when I submitted RecordWriter a few years ago. I was hoping some upstream issues would be resolved (e.g., GATs for efficient iteration, and I was expecting the Read trait to change to deal with the uninitialized buffers problem).

I think this is the best that can be done for right now, and it's better to have something than nothing.

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@masonk
Copy link
Contributor Author

masonk commented Feb 18, 2020

BTW I'm a Googler. Do I still need to sign this with my new email address?

}
/// Convert the Reader into an Iterator<Item = Result<Vec<u8>, RecordReadError>, which iterates
/// the whole file.
pub fn into_iter_owned(self) -> impl Iterator<Item = Result<Vec<u8>, RecordReadError>> {
Copy link
Contributor Author

@masonk masonk Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge, without GATs, there is no way to write an iterator that yields items that borrow from an internal buffer (so no way to reuse a buffer). Let me know if I'm missing something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think my knowledge on Rust and TF is not enough yet to give an good answer. Do you perhaps have some other people you can ask for a good review? (Or perhaps we should ask in the community?)

Also I don't have much time recently. Otherwise I will try give a closer look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, we do need GATs before we can write iterators that return a reference to internal state, but there is a workaround of returning e.g. an Rc<RefCell<Vec<u8>>>. The downsides are the runtime overhead (which is probably minimal in this case compared to reading the data, performing the CRC check, etc) and the fact that users could hold onto the internal buffer for longer than intended and modify it, which is a much larger concern. The mutability concern could be addressed by having RecordOwnedIterator return a new type, e.g. RecordOwnedIteratorItem struct which implements Borrow<[u8]> and contains the Rc<RefCell<Vec<u8>>> internally, but I suspect there's no way to control the lifetime without using actual references and lifetimes.

All in all, I'm happy with the signature as-is, especially since users can always use read_next directly.

@masonk
Copy link
Contributor Author

masonk commented Feb 18, 2020

I registered my github account with Google using the corp self-service tool, but that didn't make this bot happy.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@masonk
Copy link
Contributor Author

masonk commented Feb 24, 2020

I squashed these commits, added usage examples in the comments, and changed the email to something that makes cla happy. Please take another look at this PR.

Oh, and I consolidated InvalidLengthChecksum and InvalidContentChecksum into InvalidChecksum, as I couldn't imagine a scenario in which a user would do something different between those cases.

src/io.rs Outdated
/// Err(e) => { warn!("{:?}", e); break }
/// }
/// }
pub fn read_next(&mut self, buf: &mut [u8]) -> Result<Option<u64>, RecordReadError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is problematic if you don't know the maximum record size with certainty, because if the buffer is too small, it can't be retried with a larger buffer, because the record length has already been read. Either RecordReader could use a BufReader internally and do an initial non-consuming read to get the record length, so that the record size could still be read if buf is too small and the function needs to be read again, or it could track the record size explicitly in its state if it has been consumed but the data hasn't, or this function could take a &mut Vec<u8> rather than &mut [u8] so it could be grown as necessary to hold the data.

Copy link
Contributor Author

@masonk masonk Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

I went with the following approach:

  • pub fn peek_next_len(), which tells the length of the next record, and can be used to resize a heap buffer in advance of the read_next call. Rather than wrap into a BufReader internally, I added Seek bounds, giving control to the caller. In most cases they will be passing BufReaders anyway.
  • read_next() skips records which are longer than the supplied buffer. A subsequent call resumes on the next record in the file. If the user doesn't know the max size, they can peek first.

}
/// Convert the Reader into an Iterator<Item = Result<Vec<u8>, RecordReadError>, which iterates
/// the whole file.
pub fn into_iter_owned(self) -> impl Iterator<Item = Result<Vec<u8>, RecordReadError>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, we do need GATs before we can write iterators that return a reference to internal state, but there is a workaround of returning e.g. an Rc<RefCell<Vec<u8>>>. The downsides are the runtime overhead (which is probably minimal in this case compared to reading the data, performing the CRC check, etc) and the fact that users could hold onto the internal buffer for longer than intended and modify it, which is a much larger concern. The mutability concern could be addressed by having RecordOwnedIterator return a new type, e.g. RecordOwnedIteratorItem struct which implements Borrow<[u8]> and contains the Rc<RefCell<Vec<u8>>> internally, but I suspect there's no way to control the lifetime without using actual references and lifetimes.

All in all, I'm happy with the signature as-is, especially since users can always use read_next directly.

@masonk masonk force-pushed the master branch 2 times, most recently from d1d1e3c to a71d663 Compare March 4, 2020 07:28
return Err(RecordReadError::CorruptFile);
}
return Err(RecordReadError::CorruptRecord);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified to:

if !len_ok {
  Err(RecordReadError::CorruptFile)
} else if !bytes_ok {
  Err(RecordReadError::CorruptRecord)
} else {
  Ok(Some(len as usize))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh. I think I was a little tired last night.

@adamcrume adamcrume merged commit 58f961f into tensorflow:master Mar 6, 2020
@adamcrume
Copy link
Contributor

Looks good, thanks a lot!

@masonk
Copy link
Contributor Author

masonk commented Mar 7, 2020

Thanks for the great reviews

@liufuyang
Copy link

Thank you both very much on this 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants