Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why is xformers not used for attention computation? #608

Open
jason718 opened this issue Oct 9, 2024 · 4 comments
Open

why is xformers not used for attention computation? #608

jason718 opened this issue Oct 9, 2024 · 4 comments
Labels
question Further information is requested

Comments

@jason718
Copy link

jason718 commented Oct 9, 2024

Curious why xformers is not used? Is it for simplicity or is there performance reason.

@awgu
Copy link
Contributor

awgu commented Oct 9, 2024

F.scaled_dot_product_attention calls into flash or memory efficient attention depending on some factors (should be mainly flash for the torchtitan case iiuc). Are there other ops that you have in mind?

@casper-hansen
Copy link

@awgu It looks like xformers has support for Flash Attention v3 starting from 0.0.28 (flash3.FwOp and flash3.BwOp). Could bring extra training efficiency for Hopper arch as it's not implemented in pytorch yet.

As I read it from the blog, this brings a 1.6x-1.8x speedup over FAv2.

image

@awgu
Copy link
Contributor

awgu commented Oct 11, 2024

@casper-hansen Makes sense!

I guess it should not be too hard for users to install xformers and replace the F.scaled_dot_product_attention_call with the xformers attention call. This should work as long as the xformers attention is torch.compile-compatible, which I recall it is.

Since torchtitan is mainly for showing an example of how to set this kind of distributed training up, I think including xformers attention is not as important as showing what is achievable with torch native.

@Chillee
Copy link

Chillee commented Oct 13, 2024

@casper-hansen On H100, F.scaled_dot_product_attention calls into CuDNN attention, which has a much smaller gap in performance with FA3.

@tianyu-l tianyu-l added the question Further information is requested label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants