RecordReader for TFRecords #240

masonk · 2020-02-18T03:02:31Z

I guess I left the work unfinished when I submitted RecordWriter a few years ago. I was hoping some upstream issues would be resolved (e.g., GATs for efficient iteration, and I was expecting the Read trait to change to deal with the uninitialized buffers problem).

I think this is the best that can be done for right now, and it's better to have something than nothing.

googlebot · 2020-02-18T03:02:35Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

masonk · 2020-02-18T03:19:33Z

BTW I'm a Googler. Do I still need to sign this with my new email address?

src/io.rs

masonk · 2020-02-18T03:27:47Z

src/io.rs

+    }
+    /// Convert the Reader into an Iterator<Item = Result<Vec<u8>, RecordReadError>, which iterates
+    /// the whole file.
+    pub fn into_iter_owned(self) -> impl Iterator<Item = Result<Vec<u8>, RecordReadError>> {


To my knowledge, without GATs, there is no way to write an iterator that yields items that borrow from an internal buffer (so no way to reuse a buffer). Let me know if I'm missing something.

Hmm, I think my knowledge on Rust and TF is not enough yet to give an good answer. Do you perhaps have some other people you can ask for a good review? (Or perhaps we should ask in the community?)

Also I don't have much time recently. Otherwise I will try give a closer look.

From what I understand, we do need GATs before we can write iterators that return a reference to internal state, but there is a workaround of returning e.g. an Rc<RefCell<Vec<u8>>>. The downsides are the runtime overhead (which is probably minimal in this case compared to reading the data, performing the CRC check, etc) and the fact that users could hold onto the internal buffer for longer than intended and modify it, which is a much larger concern. The mutability concern could be addressed by having RecordOwnedIterator return a new type, e.g. RecordOwnedIteratorItem struct which implements Borrow<[u8]> and contains the Rc<RefCell<Vec<u8>>> internally, but I suspect there's no way to control the lifetime without using actual references and lifetimes.

All in all, I'm happy with the signature as-is, especially since users can always use read_next directly.

masonk · 2020-02-18T03:32:25Z

I registered my github account with Google using the corp self-service tool, but that didn't make this bot happy.

src/io.rs

googlebot · 2020-02-24T04:16:51Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

masonk · 2020-02-24T04:20:46Z

I squashed these commits, added usage examples in the comments, and changed the email to something that makes cla happy. Please take another look at this PR.

Oh, and I consolidated InvalidLengthChecksum and InvalidContentChecksum into InvalidChecksum, as I couldn't imagine a scenario in which a user would do something different between those cases.

src/io.rs

adamcrume · 2020-02-29T22:08:45Z

src/io.rs

+    ///         Err(e) => { warn!("{:?}", e); break }
+    ///     }
+    /// }
+    pub fn read_next(&mut self, buf: &mut [u8]) -> Result<Option<u64>, RecordReadError> {


This method is problematic if you don't know the maximum record size with certainty, because if the buffer is too small, it can't be retried with a larger buffer, because the record length has already been read. Either RecordReader could use a BufReader internally and do an initial non-consuming read to get the record length, so that the record size could still be read if buf is too small and the function needs to be read again, or it could track the record size explicitly in its state if it has been consumed but the data hasn't, or this function could take a &mut Vec<u8> rather than &mut [u8] so it could be grown as necessary to hold the data.

Good catch.

I went with the following approach:

pub fn peek_next_len(), which tells the length of the next record, and can be used to resize a heap buffer in advance of the read_next call. Rather than wrap into a BufReader internally, I added Seek bounds, giving control to the caller. In most cases they will be passing BufReaders anyway.

read_next() skips records which are longer than the supplied buffer. A subsequent call resumes on the next record in the file. If the user doesn't know the max size, they can peek first.

src/io.rs

adamcrume · 2020-02-29T22:57:57Z

src/io.rs

+    }
+    /// Convert the Reader into an Iterator<Item = Result<Vec<u8>, RecordReadError>, which iterates
+    /// the whole file.
+    pub fn into_iter_owned(self) -> impl Iterator<Item = Result<Vec<u8>, RecordReadError>> {


From what I understand, we do need GATs before we can write iterators that return a reference to internal state, but there is a workaround of returning e.g. an Rc<RefCell<Vec<u8>>>. The downsides are the runtime overhead (which is probably minimal in this case compared to reading the data, performing the CRC check, etc) and the fact that users could hold onto the internal buffer for longer than intended and modify it, which is a much larger concern. The mutability concern could be addressed by having RecordOwnedIterator return a new type, e.g. RecordOwnedIteratorItem struct which implements Borrow<[u8]> and contains the Rc<RefCell<Vec<u8>>> internally, but I suspect there's no way to control the lifetime without using actual references and lifetimes.

All in all, I'm happy with the signature as-is, especially since users can always use read_next directly.

src/io.rs

adamcrume · 2020-03-05T04:27:30Z

src/io.rs

+                return Err(RecordReadError::CorruptFile);
+            }
+            return Err(RecordReadError::CorruptRecord);
+        }


This could be simplified to:

if !len_ok { Err(RecordReadError::CorruptFile) } else if !bytes_ok { Err(RecordReadError::CorruptRecord) } else { Ok(Some(len as usize)) }

Heh. I think I was a little tired last night.

src/io.rs

adamcrume · 2020-03-06T04:12:09Z

Looks good, thanks a lot!

masonk · 2020-03-07T03:09:29Z

Thanks for the great reviews

liufuyang · 2020-03-13T06:50:08Z

Thank you both very much on this 😄

masonk mentioned this pull request Feb 18, 2020

Contributing TFRecordWriter and TFExampleParser #162

Closed

masonk commented Feb 18, 2020

View reviewed changes

src/io.rs Outdated Show resolved Hide resolved

masonk commented Feb 18, 2020

View reviewed changes

src/io.rs Show resolved Hide resolved

masonk force-pushed the master branch from 55f1e81 to 2a437d9 Compare February 24, 2020 04:16

masonk force-pushed the master branch from 2a437d9 to c8d3f17 Compare February 24, 2020 04:24

adamcrume requested changes Feb 29, 2020

View reviewed changes

masonk force-pushed the master branch 2 times, most recently from d1d1e3c to a71d663 Compare March 4, 2020 07:28

RecordReader for TFRecords

bebc97b

masonk force-pushed the master branch from a71d663 to bebc97b Compare March 4, 2020 07:32

adamcrume requested changes Mar 5, 2020

View reviewed changes

code reviews for RecordReader

cb8b63d

adamcrume approved these changes Mar 6, 2020

View reviewed changes

adamcrume merged commit 58f961f into tensorflow:master Mar 6, 2020

RecordReader for TFRecords #240

RecordReader for TFRecords #240

Uh oh!

Conversation

masonk commented Feb 18, 2020

Uh oh!

googlebot commented Feb 18, 2020

Uh oh!

masonk commented Feb 18, 2020

Uh oh!

Uh oh!

masonk Feb 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liufuyang Feb 20, 2020

Choose a reason for hiding this comment

Uh oh!

adamcrume Feb 29, 2020

Choose a reason for hiding this comment

Uh oh!

masonk commented Feb 18, 2020

Uh oh!

Uh oh!

googlebot commented Feb 24, 2020

Uh oh!

masonk commented Feb 24, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamcrume Feb 29, 2020

Choose a reason for hiding this comment

Uh oh!

masonk Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adamcrume Feb 29, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adamcrume Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

masonk Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamcrume commented Mar 6, 2020

Uh oh!

masonk commented Mar 7, 2020

Uh oh!

liufuyang commented Mar 13, 2020

Uh oh!

Uh oh!

masonk Feb 18, 2020 •

edited

Loading

masonk Mar 4, 2020 •

edited

Loading