Skip to content

Contributing TFRecordWriter and TFExampleParser #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
masonk opened this issue May 13, 2018 · 15 comments
Closed

Contributing TFRecordWriter and TFExampleParser #162

masonk opened this issue May 13, 2018 · 15 comments

Comments

@masonk
Copy link
Contributor

masonk commented May 13, 2018

In my own project, I am using Rust to write TFRecords full of TFExamples to disk.

The code for doing this is something that might be useful to many projects that import the tensorflow crate, and I'd like to submit a PR for these utils.

However, it would be nice to have an agreement from package maintainers on the specifics of the design before I code up the CL. To give an example of outstanding questions:

  1. Do you even want RecordWriter and/or ExampleWriter in this crate?
  2. Namespace? By analogy to the Python API, what about tensorflow::io?
  3. File structure? src/io.rs?
@adamcrume
Copy link
Contributor

I like the idea, and I appreciate asking for direction before submitting a pull request. I have a few questions; please don't be scared off. :)

In response to your questions:

  1. In a general sense, yes. I think if something shows up in the standard TensorFlow API, it probably makes sense to have it in this crate. We use the C API as much as possible, though, and this isn't currently exposed through that API, so I assume this would be pure Rust? If so, it would have to come with the caveat that if this functionality is exposed through the C API later, we may switch to use that in order to ensure consistent behavior with other language bindings.
  2. That sounds fine. We do need to start organizing our namespace, which is a mess at the moment.
  3. src/io.rs sounds good to me.

Would this introduce additional dependencies? If this requires a protobuf implementation, we would want to be careful to choose the right crate. Likewise if there are any file format dependencies.

Does this include something like an Example struct? What would that look like? We currently can't express something like Map<String, Tensor> because Tensor is generic in the element type, so you couldn't have a map containing both int and float tensors, for example. We could expose something similar to the internal AnyTensor trait, and then store a Map<String, Box<AnyTensor>>.

@jhseu Do you know if record/example IO will ever be provided via the C API?

@masonk
Copy link
Contributor Author

masonk commented May 14, 2018

Yes, RecordWriter is pure Rust. Its deps are crc32c and byteorder. I have no qualms with replacing the code with externs to C, of course. I'm pretty sure the binary format of a TFRecord is stable at this point, so I think the best reason to do it would be perf.

As for creating Examples, mine are serialized using the protobuf crate. I haven't spent any time wrapping the raw protobuf gencode API . I will put some put thought into an Example API, as there are a few layers of boilerplate which kind of clutter things. And, yes, this way would entail a protobuf dep. I think it would make sense to make this an optional dep.

Regardless of whether we wrap the gencode API, I think that example.proto and feature.proto are so stable that our best option is to check in the gencode to avoid requiring a build step (and, by the way, a dependency on protoc). We'd only have to update the repo with new gencode whenever the main repository updates those protobufs which, I wager, will happen rarely. There's a lot of data sitting on disks that is in this format, so I wouldn't be surprised if it never changes.

@jhseu
Copy link

jhseu commented May 16, 2018

We won't have a C API for RecordWriter. The (uncompressed) implementation is simple enough that there isn't really a need. We already have a separate Java implementation in the ecosystem repository.

I would suggest that you design it so that it's orthogonal to tf.Examples. From a TensorFlow perspective, we support reading any arbitrary strings out of RecordWriter, so the interface should just be about reading and writing bytes.

@adamcrume
Copy link
Contributor

That sounds good.

@masonk Would you might sending an initial pull request just for the RecordWriter with support for writing strings, leaving out Example support? Once that's sorted we can hash out the details for Examples.

@masonk
Copy link
Contributor Author

masonk commented May 16, 2018

Sounds good. I think I'll get to it this weekend.

@masonk
Copy link
Contributor Author

masonk commented May 21, 2018

Would you like ExampleWriter, too?

@adamcrume
Copy link
Contributor

Yes, but I'd like a little more detail first so we can make sure we're on the same page. Which crate(s) would you add? What exactly would the public API be?

@masonk
Copy link
Contributor Author

masonk commented May 21, 2018 via email

@adamcrume
Copy link
Contributor

Sounds good. Here's fine.

@liufuyang
Copy link

One question, do we have APIs in Rust to read TFRecords? 🤔

@masonk
Copy link
Contributor Author

masonk commented Feb 17, 2020

I never submitted a record reader, and I don't see one in the repo right now.

It would be good to have one, though.

@masonk
Copy link
Contributor Author

masonk commented Feb 18, 2020

I revisited this today to remember why I never submitted a reader. Here's what I remember:

  • Reading a record can fail checksum validation or result in a short read -> need to expose an error type.
  • The efficient api is reading into a byte slice, but that's also fallible if the slice isn't long enough for the record.
  • It would be most natural to provide an iterator over the records, but no efficient implementation is possible without GATS which I think I was hoping would already be in the language by now.
  • I also think I thought that the problem with reading into uninitialized memory would be resolved by now (recent blog post about this).

@masonk
Copy link
Contributor Author

masonk commented Feb 18, 2020

Pull request for a basic reader

#240

@liufuyang
Copy link

Interesting, thanks for the fast response, that is a lot of info to take in.

Perhaps just another question so far. The first case I am using Rust to read TFRecord is probably on the serving side. As we will have some features ready in places like BigTable and when providing a user id on the backend we will get a single feature row for that user, in TRRecord formmat.

I am thinking about using Rust to build a backend to host the model and take in the TFRecord (one row) and return a model prediction to the caller.

I think with your PR #240 this can also be achieved? :)

@masonk
Copy link
Contributor Author

masonk commented Feb 18, 2020

Yes, it should be plenty for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants