Skip to content

FFmpeg-based rescaling and frame rate #3016

Open
@stefanwayon

Description

@stefanwayon

🚀 Feature

Add support for (basic) FFmpeg filters for faster video pre-processing. In particular, rescaling and changing the frame rate would be useful when feeding in-the-wild videos through a trained model.

Motivation

I am working on a video loader to feed video frames to a model trained on the Kinetics 400 dataset and obtain predictions. The model is trained at a fixed resolution, on videos with a frame rate of 15fps. To support making predictions on videos from various sources, I at least need to resample them at the correct resolution and frame rate.

The current public API only supports decoding of video frames and trimming, but not any other pre-processing, so I need to do any such pre-processing in Python/PyTorch. Such an approach is visibly slower when compared to an implementation based on ffmpeg-python – a wrapper around the command line ffmpeg. For some stats, see Additional context.

Pitch

I would like to start a conversation on how best to bring such functionality to Torchvision. I imagine changing the resolution/fps is a common requirement for making predictions on videos, so I can see it as a useful feature of video I/O. Looking at the C++ code, there is already some support for requesting video frames of a certain resolution [1][2], but this functionality is only exposed in torch.ops.video_reader.read_video_from_file, not the public API. I can’t find anything similar for requesting a certain frame rate.

Is this something that you would want to add to torchvision.io.read_video? What about to torchvision.io.VideoReader? More generally, is there a plan to add support for all FFmpeg filters in the future? What would that interface look like?

Additional context

I’ve done some initial comparisons between torchvision.io.VideoReader + changing frame rate in Python + torch rescaling on batches of 16 frames versus a ffmpeg-python pipeline with scale and fps filters on a 854x480@30fps MP4 input video of ~261s. I’ve included the results below.

Decoding the first seconds of a clip (output fps=15, output size=input size):

clip-length

Decoding 1s of video for given start time (output fps=15, output size=input size):

start-time

Changing the framerate for the first 1s of video (output size=input size):

framerate-1s

Changing the framerate for the first 5s of video (output size=input size):

framerate-5s

Rescaling the first 1s of video (output fps=15):

scale-1s

Rescaling the first 1s of video with bilinear-fast FFMpeg algorithm (output fps=15):

scale-1s-fast

Rescaling the first 5s of video (output fps=15):

scale-5s

cc @bjuncek

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions