Cache HW device context #3178

mthrok · 2023-03-16T15:52:15Z

This commit adds caching mechanism to CUDA device context when using GPU video decoding.

The following table shows the performance improvement from the change when decoding 3 seconds of HEVC video with CUVID on NVIDIA GeForce RTX 3080.

n	Without Patch	With Patch
1	0.593	0.562
2	0.187	0.093
3	0.203	0.094
4	0.203	0.094
5	0.219	0.188
6	0.203	0.094
7	0.203	0.094
8	0.203	0.094

Note: The cache was cleared before the 5th iteration

The first time video decoder is used, some other initialization is happening, and it is slower than the other times, but for other cases, caching device context improves the decoding speed.

With 30 seconds of HEVC video, there is 0.13 seconds of improvement.

n	Without Patch	With Patch
1	1.031	0.953
2	0.578	0.484
3	0.594	0.469
4	0.578	0.468
5	0.578	0.562
6	0.578	0.469
7	0.562	0.500
8	0.562	0.485

Note: The cache was cleared before the 5th iteration

Memory-wise it caches about 200 MB of GPU memory.

code

The data is generated with

ffmpeg -f lavfi -i mandelbrot -t 3 -c:v libx265 -pix_fmt yuv420p10le -vtag hvc1 -y test.hevc and
ffmpeg -f lavfi -i mandelbrot -t 30 -c:v libx265 -pix_fmt yuv420p10le -vtag hvc1 -y test.hevc

import torch
import torchaudio
from torchaudio.io import StreamReader

import subprocess
import time


def test():
    src = "test.hevc"

    r = StreamReader(src=src)
    r.add_video_stream(
        frames_per_chunk=-1,
        decoder="hevc_cuvid",
        hw_accel="cuda",
    )
    r.process_all_packets()
    r.pop_chunks()


def report(msg):
    print(f"{msg:20s}", end="\t", flush=True)
    subprocess.run(["nvidia-smi", "--query-gpu=memory.used,utilization.memory", "--format=csv,noheader"])


print(torchaudio.__version__)

report("Start up")

_ = torch.empty([1], device=torch.device("cuda"))
torch.cuda.empty_cache()
report("After dummy op")

for i in range(8):
    report(f"Start - {i}:")
    t0 = time.monotonic()
    test()
    elapsed = time.monotonic() - t0
    report(f"Finish - {elapsed:.3f} [sec]:")

    torch.cuda.empty_cache()
    report("Clear torch cuda cache")
    
    
    if i == 3:
        try:
            torchaudio.utils.ffmpeg_utils.clear_cuda_context_cache()
            report("clear hw context cache")
        except:
            pass

try:
    torchaudio.utils.ffmpeg_utils.clear_cuda_context_cache()
    report("clear hw context cache")
except:
    pass

raw data (3sec)

Upstream main branch

2.0.0a0+a6b34a5
Start up                1213 MiB, 21 %
After dummy op          1427 MiB, 21 %
Start - 0:              1427 MiB, 21 %
Finish - 0.593 [sec]:   1737 MiB, 2 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 1:              1465 MiB, 2 %
Finish - 0.187 [sec]:   1737 MiB, 1 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 2:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 1 %
Clear torch cuda cache  1465 MiB, 1 %
Start - 3:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 2 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 4:              1465 MiB, 2 %
Finish - 0.219 [sec]:   1737 MiB, 2 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 5:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 2 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 6:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 1 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 7:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 1 %
Clear torch cuda cache  1465 MiB, 1 %

This commit

2.0.0a0+dea6566
Start up                1213 MiB, 43 %
After dummy op          1427 MiB, 15 %
Start - 0:              1427 MiB, 15 %
Finish - 0.562 [sec]:   1948 MiB, 0 %
Clear torch cuda cache  1676 MiB, 0 %
Start - 1:              1676 MiB, 2 %
Finish - 0.093 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
Start - 2:              1676 MiB, 2 %
Finish - 0.094 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
Start - 3:              1676 MiB, 2 %
Finish - 0.094 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
clear hw context cache  1465 MiB, 2 %
Start - 4:              1465 MiB, 1 %
Finish - 0.188 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
Start - 5:              1676 MiB, 2 %
Finish - 0.094 [sec]:   1948 MiB, 1 %
Clear torch cuda cache  1676 MiB, 1 %
Start - 6:              1676 MiB, 1 %
Finish - 0.094 [sec]:   1948 MiB, 1 %
Clear torch cuda cache  1676 MiB, 1 %
Start - 7:              1676 MiB, 2 %
Finish - 0.094 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
clear hw context cache  1465 MiB, 2 %

raw data (30sec)

Upstream main branch

2.0.0a0+a6b34a5
Start up                1250 MiB, 38 %
After dummy op          1471 MiB, 12 %
Start - 0:              1471 MiB, 12 %
Finish - 1.031 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 7 %
Start - 1:              1502 MiB, 7 %
Finish - 0.578 [sec]:   4202 MiB, 8 %
Clear torch cuda cache  1502 MiB, 8 %
Start - 2:              1502 MiB, 8 %
Finish - 0.594 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 7 %
Start - 3:              1502 MiB, 7 %
Finish - 0.578 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 5 %
Start - 4:              1502 MiB, 5 %
Finish - 0.578 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 5 %
Start - 5:              1502 MiB, 5 %
Finish - 0.578 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 7 %
Start - 6:              1502 MiB, 3 %
Finish - 0.562 [sec]:   4202 MiB, 8 %
Clear torch cuda cache  1502 MiB, 8 %
Start - 7:              1502 MiB, 8 %
Finish - 0.562 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 7 %

This commit

2.0.0a0+dea6566
Start up                1252 MiB, 37 %
After dummy op          1466 MiB, 28 %
Start - 0:              1466 MiB, 28 %
Finish - 0.953 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 6 %
Start - 1:              1715 MiB, 6 %
Finish - 0.484 [sec]:   4415 MiB, 8 %
Clear torch cuda cache  1715 MiB, 8 %
Start - 2:              1715 MiB, 8 %
Finish - 0.469 [sec]:   4415 MiB, 9 %
Clear torch cuda cache  1715 MiB, 9 %
Start - 3:              1715 MiB, 9 %
Finish - 0.468 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 7 %
clear hw context cache  1504 MiB, 4 %
Start - 4:              1504 MiB, 4 %
Finish - 0.562 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 7 %
Start - 5:              1715 MiB, 4 %
Finish - 0.469 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 5 %
Start - 6:              1715 MiB, 5 %
Finish - 0.500 [sec]:   4415 MiB, 8 %
Clear torch cuda cache  1715 MiB, 8 %
Start - 7:              1715 MiB, 8 %
Finish - 0.485 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 7 %
clear hw context cache  1504 MiB, 3 %

facebook-github-bot · 2023-03-16T15:52:36Z

@mthrok has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: TODO: add cache release Pull Request resolved: pytorch#3178 Differential Revision: D44136275 Pulled By: mthrok fbshipit-source-id: 202b687c246eab285b82768a8ee91a9f45d334d7

facebook-github-bot · 2023-03-16T16:39:01Z

This pull request was exported from Phabricator. Differential Revision: D44136275

Summary: TODO: add cache release Pull Request resolved: pytorch#3178 Differential Revision: D44136275 Pulled By: mthrok fbshipit-source-id: 002aec2dba734dec9a81778d200235ab940d1b73

facebook-github-bot · 2023-03-16T19:45:53Z

This pull request was exported from Phabricator. Differential Revision: D44136275

facebook-github-bot · 2023-03-17T22:37:04Z

@mthrok merged this pull request in 0c8c138.

github-actions · 2023-03-17T22:37:10Z

Hey @mthrok.
You merged this PR, but labels were not properly added. Please add a primary and secondary label (See https://github.com/pytorch/audio/blob/main/.github/process_commit.py)

In pytorch#3178, a mechanism to cache HW context was introduced. This commit applies the reuse in StreamWriter, so that when using GPU video decoding and encoding, they are shared. This gives back about 250 - 300 MB of GPU memory.

Summary: In pytorch#3178, a mechanism to cache HW device context was introduced. This commit applies the reuse in StreamWriter, so that when using GPU video decoding and encoding, they are shared. This gives back about 250 - 300 MB of GPU memory. --- Q: What is HW device context? From https://ffmpeg.org/doxygen/4.1/structAVHWDeviceContext.html#details > This struct aggregates all the (hardware/vendor-specific) "high-level" state, i.e. > > state that is not tied to a concrete processing configuration. E.g., in an API that supports hardware-accelerated encoding and decoding, this struct will (if possible) wrap the state that is common to both encoding and decoding and from which specific instances of encoders or decoders can be derived. Pull Request resolved: pytorch#3215 Reviewed By: nateanl Differential Revision: D44504051 Pulled By: mthrok fbshipit-source-id: c52b4463af9ec6eeb01da85e7a4d6a47952aae1e

Summary: In #3178, a mechanism to cache HW device context was introduced. This commit applies the reuse in StreamWriter, so that when using GPU video decoding and encoding, they are shared. This gives back about 250 - 300 MB of GPU memory. --- Q: What is HW device context? From https://ffmpeg.org/doxygen/4.1/structAVHWDeviceContext.html#details > This struct aggregates all the (hardware/vendor-specific) "high-level" state, i.e. > > state that is not tied to a concrete processing configuration. E.g., in an API that supports hardware-accelerated encoding and decoding, this struct will (if possible) wrap the state that is common to both encoding and decoding and from which specific instances of encoders or decoders can be derived. Pull Request resolved: #3215 Reviewed By: nateanl Differential Revision: D44504051 Pulled By: mthrok fbshipit-source-id: 77579cdc8bd9e9b8a218e3f29031d091cda83860

facebook-github-bot added the CLA Signed label Mar 16, 2023

mthrok force-pushed the hw_context branch from 25db9ff to dea6566 Compare March 16, 2023 16:39

mthrok marked this pull request as ready for review March 16, 2023 19:19

mthrok requested a review from a team March 16, 2023 19:19

Cache HW device context (pytorch#3178)

68cd0d7

Summary: TODO: add cache release Pull Request resolved: pytorch#3178 Differential Revision: D44136275 Pulled By: mthrok fbshipit-source-id: 002aec2dba734dec9a81778d200235ab940d1b73

mthrok force-pushed the hw_context branch from dea6566 to 68cd0d7 Compare March 16, 2023 19:45

facebook-github-bot closed this in 0c8c138 Mar 17, 2023

facebook-github-bot added the Merged label Mar 17, 2023

mthrok deleted the hw_context branch March 17, 2023 23:58

mthrok added C++ module: IO new feature improvement labels Mar 20, 2023

mthrok mentioned this pull request Mar 29, 2023

Reuse HW device context in GPU encoder #3215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache HW device context #3178

Cache HW device context #3178

Uh oh!

mthrok commented Mar 16, 2023 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 16, 2023

Uh oh!

facebook-github-bot commented Mar 16, 2023

Uh oh!

facebook-github-bot commented Mar 16, 2023

Uh oh!

facebook-github-bot commented Mar 17, 2023

Uh oh!

github-actions bot commented Mar 17, 2023

Uh oh!

Uh oh!

Cache HW device context #3178

Cache HW device context #3178

Uh oh!

Conversation

mthrok commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 16, 2023

Uh oh!

facebook-github-bot commented Mar 16, 2023

Uh oh!

facebook-github-bot commented Mar 16, 2023

Uh oh!

facebook-github-bot commented Mar 17, 2023

Uh oh!

github-actions bot commented Mar 17, 2023

Uh oh!

Uh oh!

mthrok commented Mar 16, 2023 •

edited

Loading