About Visual Prompt Encoder. #83

fuweifu-vtoo · 2024-08-09T09:10:16Z

Dear author, I have another question for you：

In Visual Prompt Encoder, is it stacking three layers of deformable cross-attention layer, then connecting one self attention and one FFN?

Or stacking three blocks of (Deformable cross attention + self attention + FFN)

Mountchicken · 2024-08-11T01:00:27Z

three blocks of (Deformable cross attention + self attention + FFN)

pisiguiii · 2024-09-23T10:07:23Z

Hi @Mountchicken

Previously you referred to code from Grounding DINO:
#85 (comment)
, the DeformableTransformerDecoderLayer class.
I would like to clarify, when you mention "Deformable cross attention" do you mean DeformableTransformerDecoderLayer or the only self.cross_attn module from this class?

If I understood correctly, then
DeformableTransformerDecoderLayer == (Deformable cross attention + self attention + FFN)
Am I right in my conclusions?

Mountchicken · 2024-09-24T01:03:00Z

The visual prompt encoder consists of serval DeformableTransformerDecoderLayer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Visual Prompt Encoder. #83

About Visual Prompt Encoder. #83

fuweifu-vtoo commented Aug 9, 2024

Mountchicken commented Aug 11, 2024

pisiguiii commented Sep 23, 2024 •

edited

Loading

Mountchicken commented Sep 24, 2024

About Visual Prompt Encoder. #83

About Visual Prompt Encoder. #83

Comments

fuweifu-vtoo commented Aug 9, 2024

Mountchicken commented Aug 11, 2024

pisiguiii commented Sep 23, 2024 • edited Loading

Mountchicken commented Sep 24, 2024

pisiguiii commented Sep 23, 2024 •

edited

Loading