Description
🚀 Feature
Add support for (basic) FFmpeg filters for faster video pre-processing. In particular, rescaling and changing the frame rate would be useful when feeding in-the-wild videos through a trained model.
Motivation
I am working on a video loader to feed video frames to a model trained on the Kinetics 400 dataset and obtain predictions. The model is trained at a fixed resolution, on videos with a frame rate of 15fps. To support making predictions on videos from various sources, I at least need to resample them at the correct resolution and frame rate.
The current public API only supports decoding of video frames and trimming, but not any other pre-processing, so I need to do any such pre-processing in Python/PyTorch. Such an approach is visibly slower when compared to an implementation based on ffmpeg-python
– a wrapper around the command line ffmpeg
. For some stats, see Additional context.
Pitch
I would like to start a conversation on how best to bring such functionality to Torchvision. I imagine changing the resolution/fps is a common requirement for making predictions on videos, so I can see it as a useful feature of video I/O. Looking at the C++ code, there is already some support for requesting video frames of a certain resolution [1][2], but this functionality is only exposed in torch.ops.video_reader.read_video_from_file
, not the public API. I can’t find anything similar for requesting a certain frame rate.
Is this something that you would want to add to torchvision.io.read_video
? What about to torchvision.io.VideoReader
? More generally, is there a plan to add support for all FFmpeg filters in the future? What would that interface look like?
Additional context
I’ve done some initial comparisons between torchvision.io.VideoReader
+ changing frame rate in Python + torch
rescaling on batches of 16 frames versus a ffmpeg-python
pipeline with scale
and fps
filters on a 854x480@30fps MP4 input video of ~261s. I’ve included the results below.
Decoding the first seconds of a clip (output fps=15, output size=input size):
Decoding 1s of video for given start time (output fps=15, output size=input size):
Changing the framerate for the first 1s of video (output size=input size):
Changing the framerate for the first 5s of video (output size=input size):
Rescaling the first 1s of video (output fps=15):
Rescaling the first 1s of video with bilinear-fast
FFMpeg algorithm (output fps=15):
Rescaling the first 5s of video (output fps=15):
cc @bjuncek