-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New video API Proposal #2660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Adding the feature request tracker:
|
Hi, @bjuncek Regarding the BTW, any plans for supporting compressed visual features without decoding? |
Hi @bryandeng, At the moment, seek function only implements "precise" seek, that is Having said that, I believe you are correct. Specifically the behaviour we're thinking of is the following: Let's imagine there exists a pair of keyframes Also, please note that Does that answer your first question?
Not at the moment unfortunately. I'm not ruling it out completely, but it's out of scope for the next one or two releases for sure. Do you think there would be much demand for it? Best, |
Hi @bjuncek , Sorry for the late reply. Thanks for your clear explanation. Regarding my previous question 1, will a "higher level" multiple-frame read function based on this new video API be provided, which takes a list of frame indices or pts' as input and hides the details of key frames, seeking and caching from the user? This resembles how the And for question 2, personally speaking compressed visual features are among the top priorities after ordinary visual and acoustic features. And as far as I know, toolkits like MMAction2 are implementing them. Our team at Tencent uses a home-made video reading library which is also FFmpeg based. It supports CPU/GPU decoding and |
@bryandeng Hi,
The idea is to replace the current implementation of
There is unfortunately no generic and reliable way of figuring out the number of frames in an arbitrary video extension. So the approach taken by Decord, which proposes a This means that in order to reliably provide such functionality we would first need to decode all the video (or get an estimate of the pts for each frame, which might not always be possible), and that would be very slow. I would love to be proven wrong though :-)
This is currently not in our roadmap, but we could consider implementing this in the future (for torchvision 0.10 or beyond most probably). Can you open a separate issue to discuss this functionality?
@bjuncek is currently looking into implementing GPU decoding for video is something we would love to explore, although I believe for now only a subset of the formats currently support GPU decoding.
That would be great! Can you first open a separate issue for the GPU video decoding, so that we can discuss about the potential formats that would be supported etc ? cc @takatosp1 @tullie for awareness |
🚀 Feature
We're proposing to add a lower level, more flexible, and equally robust API than the one currently existing in
torchvision
.It would be implemented in C++ and compatible with torchscript. Following the merge of #2596 it would also be installable via pip or conda
Motivation
Currently, our API supports returning a tensor of
(TxCxHxW)
viaread_video
(see here) abstraction. This can be prohibitive if a user wants to get a single frame, perform some operations on a per-frame basis. For example, I've ran into multiple issues where I'd want to return a single frame, iterate over frames, or (for example in EPIC Kitchens dataset) reduce the memory usage by transforming the elements before I save them to output tensors.Pitch
We propose the following style of API:
First we'd have a constructor that would be a part of torch registered C++ classes, and would take some basic inputs.
Returning a frame is as simple as calling
next
on the container [optionally, we can define stream from which we'd like to return the next frame from]. What a frame is will largely depend on the encoding of the video. For video, it is almost always an RGB image, whilst for audio it might be a 1024point sample. In most cases, same temporal timespan is covered with variable number of frames (1s of a video might contain 30 video frames and 40 audio frames), so returning the presentation timestamp of the returned frame allows for a more precise control of the resulting clip.To get the exact frame that we want, a seek function can be exposed (with an optional stream definition). Seeking is done either to the closest keyframe before the requested timestamp, or to the exact frame if possible.
For example if we seek into the 5s of a video container, following call to
next()
will return either 1) the last keyframe before 5s in the video (ifany_frame=False
), 2a) the frame with pts=5.0 (ifany_frame=True
and frame at 5s exist), or 2b) the first frame after 5s, e.g. with pts 5.03 (ifany_frame=True
and frame at 5s doesn't exist).We plan to expose metadata getters, and add additional functionality down the line.
Alternatives
In the end, every video decoder library is a tradeoff between speed and flexibility. Libraries that support batch decoding such as decord offer greater speed (due to multithreaded Loader objects and/or GPU decoding) at the expense of dataloader compatibility, robustness (in terms of available formats), or flexibility. Other libraries that offer greater flexibility such as pyav, opencv, or decord (in sequential reading mode) can sacrifice either speed or ease of use.
We're aiming for this API to be as close in flexibility to pyav as possible, with the same (or better) per-frame decoding speed, all of which by being torch scriptable.
Additional context
Whilst technically, this would mean depreciating our current
read_video
API, during a transition period, we would actually support it through a simple function that would mimic the implementation of current read_video, with minimum to no performance impact.cc @bjuncek
The text was updated successfully, but these errors were encountered: