Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse existing cuda context if possible when creating decoders #263

Merged
merged 3 commits into from
Oct 16, 2024

Conversation

ahmadsharif1
Copy link
Contributor

@ahmadsharif1 ahmadsharif1 commented Oct 15, 2024

Creating a cuda context is slow and takes about 400 MB of VRAM on GPU

This PR ensures we reuse the existing cuda context from pytorch when creating decoders

Thank you @fmassa for pointing out this issue and helping to resolve it.

Benchmark results show a decent speed-up, especially for short videos:

Before:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/frame_numbers_1920x1080.mp4

[------------------ Decode+Resize Time -----------------]
                     |  video=frame_numbers_1920x1080.mp4
1 threads: ----------------------------------------------
      D=cuda R=none  |                 2.0               

Times are in seconds (s).

After:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/frame_numbers_1920x1080.mp4

[------------------ Decode+Resize Time -----------------]
                     |  video=frame_numbers_1920x1080.mp4
1 threads: ----------------------------------------------
      D=cuda R=none  |                 1.3               

Times are in seconds (s).

This makes decoding single videos even without resize competitive with CPU:

[------------------ Decode+Resize Time -----------------]
                     |  video=frame_numbers_1920x1080.mp4
1 threads: ----------------------------------------------
      D=cuda R=none  |                 1.5               
      D=cpu R=none   |                 2.8               

Times are in seconds (s).

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 15, 2024
@ahmadsharif1 ahmadsharif1 marked this pull request as ready for review October 16, 2024 13:20
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahmadsharif1

return hw_device_ctx;
}

// 58.26.100 introduced the concept of reusing the existing cuda context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify in the comment which major ffmpeg version 58 corresponds to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hesitant to put that in here because that could get stale. Different av* libraries get linked to different releases and there are minor releases too, but I added it here. It could potentially get stale

c10::cuda::CUDAGuard deviceGuard(device);
// Valid values for the argument to cudaSetDevice are 0 to maxDevices - 1:
// https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb
// So we ensure the deviceIndex is not negative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the noob Q - where are we ensuring this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller makes sure of that. The caller calls std::max on this deviceIndex. I'll rename this variable to be ffmpegCompatibleDeviceIndex so it's clear the max was already done.

const torch::Device& device,
torch::DeviceIndex deviceIndex,
enum AVHWDeviceType type) {
c10::cuda::CUDAGuard deviceGuard(device);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, are there existing docs (from ffmpeg or nvidia) that explain why deviceGuard() and cudaSetDevice() are needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb tells you about cudaSetDevice

As to why it's needed, a context isn't available in a secondary thread so we make it available there before trying to reuse it

@ahmadsharif1 ahmadsharif1 merged commit f8cbb62 into pytorch:main Oct 16, 2024
22 checks passed
@ahmadsharif1 ahmadsharif1 deleted the cuda6 branch October 16, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants