Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible premature temporary removal of flash attention? #1809

Open
BBC-Esq opened this issue Oct 28, 2024 · 1 comment
Open

Possible premature temporary removal of flash attention? #1809

BBC-Esq opened this issue Oct 28, 2024 · 1 comment

Comments

@BBC-Esq
Copy link

BBC-Esq commented Oct 28, 2024

I was sifting through the cuDNN documentation and came across these snippets:

"cuDNN BF16 and FP16 Fused Flash Attention now supports embedding dim = 256 use cases in forward propagation.

Expanded support of FP16 and BF16 Fused Flash Attention by adding the sliding window attention feature on NVIDIA Ampere and Hopper GPUs. For more information, refer to the cuDNN Developer Guide."

This is from the release notes for cuDNN 9.1.1 here:

https://docs.nvidia.com/deeplearning/cudnn/v9.1.1/release-notes.html#cudnn-9-1-1

At the time that ctranslate2 supported flash attention it relied on cuDNN 8.8.0...

FA was removed from the pyipi.org release due to considerations of (1) file size and (2) minimal benefit. Regarding the second issue, perhaps the cause was because at the time Ctranslate2 did not rely on cuDNN 9.1.1, which was the first version to support flash attention?

@MahmoudAshraf97
Copy link
Contributor

#1651 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants