Reuse existing cuda context if possible when creating decoders #263

ahmadsharif1 · 2024-10-15T20:51:10Z

Creating a cuda context is slow and takes about 400 MB of VRAM on GPU

This PR ensures we reuse the existing cuda context from pytorch when creating decoders

Thank you @fmassa for pointing out this issue and helping to resolve it.

Benchmark results show a decent speed-up, especially for short videos:

Before:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/frame_numbers_1920x1080.mp4

[------------------ Decode+Resize Time -----------------]
                     |  video=frame_numbers_1920x1080.mp4
1 threads: ----------------------------------------------
      D=cuda R=none  |                 2.0               

Times are in seconds (s).

After:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/frame_numbers_1920x1080.mp4

[------------------ Decode+Resize Time -----------------]
                     |  video=frame_numbers_1920x1080.mp4
1 threads: ----------------------------------------------
      D=cuda R=none  |                 1.3               

Times are in seconds (s).

This makes decoding single videos even without resize competitive with CPU:

[------------------ Decode+Resize Time -----------------]
                     |  video=frame_numbers_1920x1080.mp4
1 threads: ----------------------------------------------
      D=cuda R=none  |                 1.5               
      D=cpu R=none   |                 2.8               

Times are in seconds (s).

NicolasHug

Thanks @ahmadsharif1

NicolasHug · 2024-10-16T13:28:54Z

src/torchcodec/decoders/_core/CudaDevice.cpp

+    return hw_device_ctx;
+  }
+
+  // 58.26.100 introduced the concept of reusing the existing cuda context


Can we clarify in the comment which major ffmpeg version 58 corresponds to?

I was hesitant to put that in here because that could get stale. Different av* libraries get linked to different releases and there are minor releases too, but I added it here. It could potentially get stale

NicolasHug · 2024-10-16T13:33:12Z

src/torchcodec/decoders/_core/CudaDevice.cpp

+  c10::cuda::CUDAGuard deviceGuard(device);
+  // Valid values for the argument to cudaSetDevice are 0 to maxDevices - 1:
+  // https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb
+  // So we ensure the deviceIndex is not negative.


Sorry for the noob Q - where are we ensuring this?

The caller makes sure of that. The caller calls std::max on this deviceIndex. I'll rename this variable to be ffmpegCompatibleDeviceIndex so it's clear the max was already done.

NicolasHug · 2024-10-16T13:39:06Z

src/torchcodec/decoders/_core/CudaDevice.cpp

+    const torch::Device& device,
+    torch::DeviceIndex deviceIndex,
+    enum AVHWDeviceType type) {
+  c10::cuda::CUDAGuard deviceGuard(device);


For my own understanding, are there existing docs (from ffmpeg or nvidia) that explain why deviceGuard() and cudaSetDevice() are needed?

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb tells you about cudaSetDevice

As to why it's needed, a context isn't available in a secondary thread so we make it available there before trying to reuse it

.

4e4b80e

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 15, 2024

Merge branch 'main' of https://github.com/pytorch/torchcodec into cuda6

0afe0b3

ahmadsharif1 marked this pull request as ready for review October 16, 2024 13:20

NicolasHug approved these changes Oct 16, 2024

View reviewed changes

.

cd886f6

ahmadsharif1 merged commit f8cbb62 into pytorch:main Oct 16, 2024
22 checks passed

ahmadsharif1 deleted the cuda6 branch October 16, 2024 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse existing cuda context if possible when creating decoders #263

Reuse existing cuda context if possible when creating decoders #263

ahmadsharif1 commented Oct 15, 2024 •

edited

Loading

NicolasHug left a comment

NicolasHug Oct 16, 2024

ahmadsharif1 Oct 16, 2024

NicolasHug Oct 16, 2024

ahmadsharif1 Oct 16, 2024

NicolasHug Oct 16, 2024

ahmadsharif1 Oct 16, 2024

Reuse existing cuda context if possible when creating decoders #263

Reuse existing cuda context if possible when creating decoders #263

Conversation

ahmadsharif1 commented Oct 15, 2024 • edited Loading

NicolasHug left a comment

Choose a reason for hiding this comment

NicolasHug Oct 16, 2024

Choose a reason for hiding this comment

ahmadsharif1 Oct 16, 2024

Choose a reason for hiding this comment

NicolasHug Oct 16, 2024

Choose a reason for hiding this comment

ahmadsharif1 Oct 16, 2024

Choose a reason for hiding this comment

NicolasHug Oct 16, 2024

Choose a reason for hiding this comment

ahmadsharif1 Oct 16, 2024

Choose a reason for hiding this comment

ahmadsharif1 commented Oct 15, 2024 •

edited

Loading