Skip to content

Cache HW device context #3178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Cache HW device context #3178

wants to merge 1 commit into from

Conversation

mthrok
Copy link
Collaborator

@mthrok mthrok commented Mar 16, 2023

This commit adds caching mechanism to CUDA device context when using GPU video decoding.

The following table shows the performance improvement from the change when decoding 3 seconds of HEVC video with CUVID on NVIDIA GeForce RTX 3080.

n Without Patch With Patch
1 0.593 0.562
2 0.187 0.093
3 0.203 0.094
4 0.203 0.094
5 0.219 0.188
6 0.203 0.094
7 0.203 0.094
8 0.203 0.094

Note: The cache was cleared before the 5th iteration

The first time video decoder is used, some other initialization is happening, and it is slower than the other times, but for other cases, caching device context improves the decoding speed.

With 30 seconds of HEVC video, there is 0.13 seconds of improvement.

n Without Patch With Patch
1 1.031 0.953
2 0.578 0.484
3 0.594 0.469
4 0.578 0.468
5 0.578 0.562
6 0.578 0.469
7 0.562 0.500
8 0.562 0.485

Note: The cache was cleared before the 5th iteration

Memory-wise it caches about 200 MB of GPU memory.

code

The data is generated with

  • ffmpeg -f lavfi -i mandelbrot -t 3 -c:v libx265 -pix_fmt yuv420p10le -vtag hvc1 -y test.hevc and
  • ffmpeg -f lavfi -i mandelbrot -t 30 -c:v libx265 -pix_fmt yuv420p10le -vtag hvc1 -y test.hevc
import torch
import torchaudio
from torchaudio.io import StreamReader

import subprocess
import time


def test():
    src = "test.hevc"

    r = StreamReader(src=src)
    r.add_video_stream(
        frames_per_chunk=-1,
        decoder="hevc_cuvid",
        hw_accel="cuda",
    )
    r.process_all_packets()
    r.pop_chunks()


def report(msg):
    print(f"{msg:20s}", end="\t", flush=True)
    subprocess.run(["nvidia-smi", "--query-gpu=memory.used,utilization.memory", "--format=csv,noheader"])


print(torchaudio.__version__)

report("Start up")

_ = torch.empty([1], device=torch.device("cuda"))
torch.cuda.empty_cache()
report("After dummy op")

for i in range(8):
    report(f"Start - {i}:")
    t0 = time.monotonic()
    test()
    elapsed = time.monotonic() - t0
    report(f"Finish - {elapsed:.3f} [sec]:")

    torch.cuda.empty_cache()
    report("Clear torch cuda cache")
    
    
    if i == 3:
        try:
            torchaudio.utils.ffmpeg_utils.clear_cuda_context_cache()
            report("clear hw context cache")
        except:
            pass

try:
    torchaudio.utils.ffmpeg_utils.clear_cuda_context_cache()
    report("clear hw context cache")
except:
    pass
raw data (3sec)

Upstream main branch

2.0.0a0+a6b34a5
Start up                1213 MiB, 21 %
After dummy op          1427 MiB, 21 %
Start - 0:              1427 MiB, 21 %
Finish - 0.593 [sec]:   1737 MiB, 2 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 1:              1465 MiB, 2 %
Finish - 0.187 [sec]:   1737 MiB, 1 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 2:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 1 %
Clear torch cuda cache  1465 MiB, 1 %
Start - 3:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 2 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 4:              1465 MiB, 2 %
Finish - 0.219 [sec]:   1737 MiB, 2 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 5:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 2 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 6:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 1 %
Clear torch cuda cache  1465 MiB, 2 %
Start - 7:              1465 MiB, 2 %
Finish - 0.203 [sec]:   1737 MiB, 1 %
Clear torch cuda cache  1465 MiB, 1 %

This commit

2.0.0a0+dea6566
Start up                1213 MiB, 43 %
After dummy op          1427 MiB, 15 %
Start - 0:              1427 MiB, 15 %
Finish - 0.562 [sec]:   1948 MiB, 0 %
Clear torch cuda cache  1676 MiB, 0 %
Start - 1:              1676 MiB, 2 %
Finish - 0.093 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
Start - 2:              1676 MiB, 2 %
Finish - 0.094 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
Start - 3:              1676 MiB, 2 %
Finish - 0.094 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
clear hw context cache  1465 MiB, 2 %
Start - 4:              1465 MiB, 1 %
Finish - 0.188 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
Start - 5:              1676 MiB, 2 %
Finish - 0.094 [sec]:   1948 MiB, 1 %
Clear torch cuda cache  1676 MiB, 1 %
Start - 6:              1676 MiB, 1 %
Finish - 0.094 [sec]:   1948 MiB, 1 %
Clear torch cuda cache  1676 MiB, 1 %
Start - 7:              1676 MiB, 2 %
Finish - 0.094 [sec]:   1948 MiB, 2 %
Clear torch cuda cache  1676 MiB, 2 %
clear hw context cache  1465 MiB, 2 %
raw data (30sec)

Upstream main branch

2.0.0a0+a6b34a5
Start up                1250 MiB, 38 %
After dummy op          1471 MiB, 12 %
Start - 0:              1471 MiB, 12 %
Finish - 1.031 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 7 %
Start - 1:              1502 MiB, 7 %
Finish - 0.578 [sec]:   4202 MiB, 8 %
Clear torch cuda cache  1502 MiB, 8 %
Start - 2:              1502 MiB, 8 %
Finish - 0.594 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 7 %
Start - 3:              1502 MiB, 7 %
Finish - 0.578 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 5 %
Start - 4:              1502 MiB, 5 %
Finish - 0.578 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 5 %
Start - 5:              1502 MiB, 5 %
Finish - 0.578 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 7 %
Start - 6:              1502 MiB, 3 %
Finish - 0.562 [sec]:   4202 MiB, 8 %
Clear torch cuda cache  1502 MiB, 8 %
Start - 7:              1502 MiB, 8 %
Finish - 0.562 [sec]:   4202 MiB, 7 %
Clear torch cuda cache  1502 MiB, 7 %

This commit

2.0.0a0+dea6566
Start up                1252 MiB, 37 %
After dummy op          1466 MiB, 28 %
Start - 0:              1466 MiB, 28 %
Finish - 0.953 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 6 %
Start - 1:              1715 MiB, 6 %
Finish - 0.484 [sec]:   4415 MiB, 8 %
Clear torch cuda cache  1715 MiB, 8 %
Start - 2:              1715 MiB, 8 %
Finish - 0.469 [sec]:   4415 MiB, 9 %
Clear torch cuda cache  1715 MiB, 9 %
Start - 3:              1715 MiB, 9 %
Finish - 0.468 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 7 %
clear hw context cache  1504 MiB, 4 %
Start - 4:              1504 MiB, 4 %
Finish - 0.562 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 7 %
Start - 5:              1715 MiB, 4 %
Finish - 0.469 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 5 %
Start - 6:              1715 MiB, 5 %
Finish - 0.500 [sec]:   4415 MiB, 8 %
Clear torch cuda cache  1715 MiB, 8 %
Start - 7:              1715 MiB, 8 %
Finish - 0.485 [sec]:   4415 MiB, 7 %
Clear torch cuda cache  1715 MiB, 7 %
clear hw context cache  1504 MiB, 3 %

@facebook-github-bot
Copy link
Contributor

@mthrok has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mthrok added a commit to mthrok/audio that referenced this pull request Mar 16, 2023
Summary:
TODO: add cache release

Pull Request resolved: pytorch#3178

Differential Revision: D44136275

Pulled By: mthrok

fbshipit-source-id: 202b687c246eab285b82768a8ee91a9f45d334d7
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D44136275

@mthrok mthrok marked this pull request as ready for review March 16, 2023 19:19
@mthrok mthrok requested a review from a team March 16, 2023 19:19
Summary:
TODO: add cache release

Pull Request resolved: pytorch#3178

Differential Revision: D44136275

Pulled By: mthrok

fbshipit-source-id: 002aec2dba734dec9a81778d200235ab940d1b73
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D44136275

@facebook-github-bot
Copy link
Contributor

@mthrok merged this pull request in 0c8c138.

@github-actions
Copy link

Hey @mthrok.
You merged this PR, but labels were not properly added. Please add a primary and secondary label (See https://github.com/pytorch/audio/blob/main/.github/process_commit.py)

@mthrok mthrok deleted the hw_context branch March 17, 2023 23:58
mthrok added a commit to mthrok/audio that referenced this pull request Mar 27, 2023
In pytorch#3178, a mechanism to cache HW context was introduced.
This commit applies the reuse in StreamWriter, so that when
using GPU video decoding and encoding, they are shared.

This gives back about 250 - 300 MB of GPU memory.
mthrok added a commit to mthrok/audio that referenced this pull request Mar 29, 2023
In pytorch#3178, a mechanism to cache HW context was introduced.
This commit applies the reuse in StreamWriter, so that when
using GPU video decoding and encoding, they are shared.

This gives back about 250 - 300 MB of GPU memory.
mthrok added a commit to mthrok/audio that referenced this pull request Mar 29, 2023
Summary:
In pytorch#3178, a mechanism to cache HW device context was introduced.
This commit applies the reuse in StreamWriter, so that
when using GPU video decoding and encoding, they are shared.

This gives back about 250 - 300 MB of GPU memory.

 ---

Q: What is HW device context?
From https://ffmpeg.org/doxygen/4.1/structAVHWDeviceContext.html#details
> This struct aggregates all the (hardware/vendor-specific) "high-level" state, i.e.
>
> state that is not tied to a concrete processing configuration. E.g., in an API that supports hardware-accelerated encoding and decoding, this struct will (if possible) wrap the state that is common to both encoding and decoding and from which specific instances of encoders or decoders can be derived.

Pull Request resolved: pytorch#3215

Reviewed By: nateanl

Differential Revision: D44504051

Pulled By: mthrok

fbshipit-source-id: c52b4463af9ec6eeb01da85e7a4d6a47952aae1e
facebook-github-bot pushed a commit that referenced this pull request Mar 29, 2023
Summary:
In #3178, a mechanism to cache HW device context was introduced.
This commit applies the reuse in StreamWriter, so that
when using GPU video decoding and encoding, they are shared.

This gives back about 250 - 300 MB of GPU memory.

 ---

Q: What is HW device context?
From https://ffmpeg.org/doxygen/4.1/structAVHWDeviceContext.html#details
> This struct aggregates all the (hardware/vendor-specific) "high-level" state, i.e.
>
> state that is not tied to a concrete processing configuration. E.g., in an API that supports hardware-accelerated encoding and decoding, this struct will (if possible) wrap the state that is common to both encoding and decoding and from which specific instances of encoders or decoders can be derived.

Pull Request resolved: #3215

Reviewed By: nateanl

Differential Revision: D44504051

Pulled By: mthrok

fbshipit-source-id: 77579cdc8bd9e9b8a218e3f29031d091cda83860
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants