Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: What are some recommended better alternatives for RBP ? #105

Open
fzimmermann89 opened this issue Jun 4, 2024 · 5 comments
Open

Comments

@fzimmermann89
Copy link

First of all, thank you for providing this library.

I want to move a 2D Swin image->image model to neighbourhood attention. So for, I have been using the relative positional embeddings as in the original Swin repo.

Both in issues as well as the documentation of the fused attention, you mention that there will most likely never be an implementation of RBP in the fused kernels, and that there are better alternatives.
... Could you maybe give me some pointers to techniques that work in you experience well with neighborhood attention?

Cheers
Felix

@alihassanijr
Copy link
Member

Thank you; I'm very glad you found it useful.

With regard to RPB, yes, there actually are very good alternatives that bias the inputs to the attention operator instead of attention weights, and they not only provide similar or better accuracy than RPB, they're easier to train, and are (usually) cheaper. This is actually what made us not bother with RPB / attention bias, because it usually defeats the purpose of kernel fusion, and further bottlenecks an already complicated backwards kernel.

We're going to push out a new preprint in the coming weeks that directly addresses this, and of course everything will be open sourced at that time.

@zaptrem
Copy link

zaptrem commented Jul 6, 2024

Thank you; I'm very glad you found it useful.

With regard to RPB, yes, there actually are very good alternatives that bias the inputs to the attention operator instead of attention weights, and they not only provide similar or better accuracy than RPB, they're easier to train, and are (usually) cheaper. This is actually what made us not bother with RPB / attention bias, because it usually defeats the purpose of kernel fusion, and further bottlenecks an already complicated backwards kernel.

We're going to push out a new preprint in the coming weeks that directly addresses this, and of course everything will be open sourced at that time.

Are you saying there are existing techniques that are better (in which case could you name them explicitly so we could use them?) or that you have invented a new one (which you'd understandably like to publish at the same time as your preprint?)

Also, do these techniques support inference on unseen sequence lengths (like ConvNeXT)? Thanks!

@alihassanijr
Copy link
Member

Yes, rotary embeddings, if tuned correctly, often outperform RPB, and they are easier to implement and performance optimize in a lot of ways.
And I can't speak to ConvNeXt, but we've found that rotary embeddings are also more stable than RPB when dealing with varying sequence lengths.

@mliuschi
Copy link

We're going to push out a new preprint in the coming weeks that directly addresses this, and of course everything will be open sourced at that time.

@alihassanijr Would you happen to have a link to the preprint? I'm also curious to learn more about alternatives to RPB for neighborhood attention. Thanks!

@alihassanijr alihassanijr transferred this issue from SHI-Labs/NATTEN Oct 1, 2024
@alihassanijr alihassanijr reopened this Oct 1, 2024
@alihassanijr
Copy link
Member

I moved this issue here since this issue is more related to NAT/DiNAT as opposed to NATTEN.

We'll be updating this thread soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants