-
Notifications
You must be signed in to change notification settings - Fork 7.1k
2022: state of video IO in torchvision #5720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello. Apologies if this is 'the wrong place' to post feedback on video functions. One thing important for professional use cases (read - working with master video files) is PTS timing that maintains the rational integer representation of sample based timing. Ie libAV provides access to the streams time base, as well as the presentation time stamp as a numerator. Another important aspect is the ability to introspect rich container metadata, as well as timecode data. This is important for correspondence with side car (or embedded) text tracks like closed captioning / subtitles. WRT to PyAV - it appears with some specific code invocations and a properly compiled FFMPEG, and a pyAV install that doesn't overwrite the FFMPEG library, that GPU decode is possible. ( See PyAV-Org/PyAV#451 for nv_dec + And PyAV-Org/PyAV#596 for nv_enc encoding with a 10x speed up. I think the only missing piece is direct GPU decode to a tensor without CPU read back. It seems like to some degree the GPU expertise demonstrated by the PyTorch developers might be better suited to help support PyAV directly, so the wider community can reap the benefits of a HW accelerated PyAV, direct to GPU decode, and PyTorch gains the benefit of using PyAV which can supply the above 'pro video' stream access to text, metadata, audio, and video streams, as well as fall back to software decode if needed. Apologies if this is longwinded or misplaced. Im excited for a functional solution to native high performance video infrastructure in DL tooling. Thanks. |
FWIW, With a properly set up FFMPEG install (I used jrottenberg/ffmpeg 4.4.2-nvidia2004 base container, install python, and did h264: Took 0:00:18.961413 in this code:
Media Info on the file:
On a 3090 |
It also seems that there's overlap of VideoReader with torchaudio's StreamReader which was added in the latest release: https://pytorch.org/audio/0.12.0/io.html#streamreader, this StreamReader boasts even GPU-based video decoding. IMO it's quite needed that there's no duplication and different APIs for the same thing (especially if the goals are very similar). Maybe factor out these to some common repo/library. If very much wanted, wheels of torchvision and torchaudio could register their own handlers / plugins into the common IO layer. Or even they could ship their own compiled libraries, but at least the source code / API should be unified. Maybe all image/audio/video IO could be moved to some torchio module. StreamReader probably also comes with its own quirks and problems of ffmpeg compilation / from-source compilation Factoring ffmpeg-related stuff into its own package would also simplify testing / building of simpler parts of torchvision/torchaudio. |
There have been many developments over the last couple of months with a big push in 2022H1 to get things closed up (mainly by @prabhat00155 and @datumbox). Here I'll try to summarize what is the current state of things.
Features (current, in-dev)
At the moment,
torchvision
has two API's one can use for video-reading.read_video
video API (stable) -- this is a legacy video-reading solution that we're looking to move away from. However, due to external use, we continue to support and patch it. It supportspyav
andvideo_reader
backends.VideoReader
fine-grained API (prototypem New video API Proposal #2660) -- we're moving towards this as a goal for 2022. The API itself is finished, however, due to issues with various backends it still remains unused (see the installation issue below). Supportsvideo_reader
andGPU
backends.Furthermore, we also have three backends for video reading.
pyav
-- naive extension of pyAV capabilitiesvideo_reader
-- our own C++ implementation that allows video IO to be torchscriptable. If JIT requirement is dropped, might be deprecated despite minor speed improvements overpyav
.GPU
-- highly experimental and not-yet properly tested. Maintenance and further development will depend on the demand from customers and community.Overall goal in 2022 is to migrate all APIs (and prototype datasets) to the
VideoReader
API, and hopefully depricateread_video
as much as possible.Related tasks include (will be updated):
Currently known issues and enhancements needed
Probably the biggest issue plaguing video is installation (see #4260 for some reference). If user wants to install ffmpeg or GPU backends and support for
VideoReader
API, they need to install torchvision from source, and in the case of GPU also download proprietary drivers from NVIDIA. This process should be properly documented until a better/alternative solution is found.Due to the lack of users, the real-world bug reports have been scarce. Here is the (non-exhaustive) list of known issues, and their progress, sorted by topic, with additional comments in italics if applicable.
General
video_reader
backend andVideoReader
APIGPU decoding issues and enhancements (note, these are low-pri due to lack of developers and road-map changes so we'll be relatively slow in fixing these):
Archived feature requests
cc @datumbox for visibility
The text was updated successfully, but these errors were encountered: