-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse existing cuda context if possible when creating decoders #263
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ahmadsharif1
return hw_device_ctx; | ||
} | ||
|
||
// 58.26.100 introduced the concept of reusing the existing cuda context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify in the comment which major ffmpeg version 58
corresponds to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hesitant to put that in here because that could get stale. Different av* libraries get linked to different releases and there are minor releases too, but I added it here. It could potentially get stale
c10::cuda::CUDAGuard deviceGuard(device); | ||
// Valid values for the argument to cudaSetDevice are 0 to maxDevices - 1: | ||
// https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb | ||
// So we ensure the deviceIndex is not negative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the noob Q - where are we ensuring this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller makes sure of that. The caller calls std::max on this deviceIndex. I'll rename this variable to be ffmpegCompatibleDeviceIndex so it's clear the max was already done.
const torch::Device& device, | ||
torch::DeviceIndex deviceIndex, | ||
enum AVHWDeviceType type) { | ||
c10::cuda::CUDAGuard deviceGuard(device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, are there existing docs (from ffmpeg or nvidia) that explain why deviceGuard()
and cudaSetDevice()
are needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb tells you about cudaSetDevice
As to why it's needed, a context isn't available in a secondary thread so we make it available there before trying to reuse it
Creating a cuda context is slow and takes about 400 MB of VRAM on GPU
This PR ensures we reuse the existing cuda context from pytorch when creating decoders
Thank you @fmassa for pointing out this issue and helping to resolve it.
Benchmark results show a decent speed-up, especially for short videos:
Before:
After:
This makes decoding single videos even without resize competitive with CPU: