Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Visual Prompt Encoder. #83

Open
fuweifu-vtoo opened this issue Aug 9, 2024 · 3 comments
Open

About Visual Prompt Encoder. #83

fuweifu-vtoo opened this issue Aug 9, 2024 · 3 comments

Comments

@fuweifu-vtoo
Copy link

Dear author, I have another question for you:

In Visual Prompt Encoder, is it stacking three layers of deformable cross-attention layer, then connecting one self attention and one FFN?

Or stacking three blocks of (Deformable cross attention + self attention + FFN)

@Mountchicken
Copy link
Collaborator

three blocks of (Deformable cross attention + self attention + FFN)

@pisiguiii
Copy link

pisiguiii commented Sep 23, 2024

Hi @Mountchicken

Previously you referred to code from Grounding DINO:
#85 (comment)
, the DeformableTransformerDecoderLayer class.
I would like to clarify, when you mention "Deformable cross attention" do you mean DeformableTransformerDecoderLayer or the only self.cross_attn module from this class?

If I understood correctly, then
DeformableTransformerDecoderLayer == (Deformable cross attention + self attention + FFN)
Am I right in my conclusions?

@Mountchicken
Copy link
Collaborator

The visual prompt encoder consists of serval DeformableTransformerDecoderLayer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants