Skip to content

[discussion needed] [videoAPI][bc-breaking] custom class for video frames #2981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bjuncek opened this issue Nov 10, 2020 · 3 comments
Open

Comments

@bjuncek
Copy link
Contributor

bjuncek commented Nov 10, 2020

🚀 Feature

Design and implement a custom class to hold the return of a video frame.

Motivation

At the moment the new VideoReader API returns a dictionary with a byte tensor "data" and a "pts" value. This works well for numerical data (such as audio and video), however, it may cause issues with CC and SUB streams that return strings. Additionally, if we ever want to expose additional fields and functionalities, this approach would make it easy to do from either C++ or python.

Pitch

I propose to have a custom registered VideoFrame class that would be the sole return value of the video reader API.
For example, we could get a new frame:

frame = next(VideoReaderObject)

where the frame object would have a set type, and be able to return a value based on that return type.

VideoReaderObject.set_current_stream("video")
frame = next(VideoReaderObject)
frame.type()     # returns "video"
frame.data()     # retruns uint8 byte tensor with video frame
frame.pts().      # returns float with the current presentation timestamp

VideoReaderObject.set_current_stream("subtitle")
frame = next(VideoReaderObject)
frame.type()     # returns "subtitle"
frame.data()     # returns a string

Alternatives

We could keep the current API, and cast the pointer to the string to the byte tensor, and then cast it back in python.
Not sure if and how well this could work, but it would be less disruptive. We can always alter the behaviour of python API in the wrapper we have at the moment, but I'm not sure what sort of time overhead that would be adding.

cc @bjuncek

@fmassa
Copy link
Member

fmassa commented Feb 15, 2021

A few questions:

  • will we have different Frame classes for each type, or a single one for all types?
  • the current proposed API doesn't play well with torchscript on the Python side, as .data() will return different types depending on what has been set up. Is the implementation in [WIP] new videoreader API now returns a scriptable Frame object #3077 up to data wrt this?
  • From C++, will we have different functions for getting the data as subtitles or audio?

PS: I've just realized that the current python wrapper is not torchscript-friendly either, because IIRC torchscript doesn't support dictionaries with heterogeneous types

@bjuncek
Copy link
Contributor Author

bjuncek commented Feb 16, 2021

There are a few potential solutions:

will we have different Frame classes for each type, or a single one for all types?

Ideally, we'd have a single frame class in C++. This class would have different (and extendable) ways of retrieving the data based on the type. In the WIP PR I have implemented .data_t() which returns tensor (video, audio). The idea is that you can follow that with .data_s() which would return a string (subtitles, CC). In principle, one can imagine extending this with any (scriptable) datatype.

the current proposed API doesn't play well with torchscript on the Python side, as .data() will return different types depending on what has been set up. Is the implementation in #3077 up to data wrt this?

I didn't know this was an issue (we have the python wrapper set up like this at the moment), and I was planning to keep it that way. An alternative would be to have the api modified slightly like this:

VideoReaderObject.set_current_stream("video")
frame = next(VideoReaderObject)
frame.type()     # returns "video"
frame.video()   # retruns uint8 byte tensor with video frame
frame.audio()   # returns empty tensor
frame.subtitle() # returns empty string
frame.pts().      # returns float with the current presentation timestamp



VideoReaderObject.set_current_stream("subtitle")
frame = next(VideoReaderObject)
frame.type()       # returns "subtitle"
frame.subtitle()  # returns a string
frame.video().    # returns empty tensor

From C++, will we have different functions for getting the data as subtitles or audio?

Yes; we have this now already (filling uint8 vs copying float tensor). We'd need to add filling the char buffer for string and cc.

@bjuncek
Copy link
Contributor Author

bjuncek commented Feb 16, 2021

PS: I've just realized that the current python wrapper is not torchscript-friendly either, because IIRC torchscript doesn't support dictionaries with heterogeneous types

would the move to the API like above help in mitigating this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants