-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory_efficient_attention faster than flash attention 2 backend? #1180
Comments
Started to work on the pre-reqs: pytorch/pytorch#143515 But yeah as of right now the most performant kernel we have in PyTorch is the CUDNN backend on h100 |
@danthe3rd yes i'm using bf16 on H100. i tried with the flash attention 2 replacement of memory efficient attention but couldn't see expected speedup. flash attention 3 from the official repo https://github.com/Dao-AILab/flash-attention is much faster but not a out of the box replacement and requires finetuning.
@drisspg i see so SDPBackend. CUDNN_ATTENTION is the fastest? even faster than FA-2?? What About A100 and A10s? any other way to speed |
SDPBackend. CUDNN_ATTENTION is the fastest implementation currently supported for SDPA and is meant for h100 + gpus. For A100 and A10s FAv2 is still your best bet
Curious to learn more about this line |
For xFormers, just set |
❓ Questions and Help
expected other way around. what is the fastest kernel i can use here?
The text was updated successfully, but these errors were encountered: