About Visual Prompt Encoder and Contrastive Alignment #85

hao416 · 2024-08-23T08:34:38Z

Hello, authors. I would like to ask two questiones. 1. How to deal with box query feature and point query feature after deformable cross-
attention, contact? 2. How to get corresponding text prompts embedding from [CLS] token output, such as "cat", "dog"

Mountchicken · 2024-08-24T09:57:26Z

Hi @hao416
Sorry for the late reply. During training, we will train box prompt and point prompt in different iterations, i.e. they will not be used at the same time. 2. CLIP will add a [CLS] token to the input sentence as default and we can extract the feature of [CLS] token at the output of CLIP.

hao416 · 2024-08-24T10:19:02Z

Thanks,dear author. But I have another question. Grounding dino contacts many labels as input sentence,  so trex2 uses the way? But I see that you said trex2 uese Phrase in the github issues. If you use sentence,  I don't know how to locate corresponding label embeddings from cls token, because its size is 1x 516. But if you use distinct phrases,  negative labels can't play a role. I'm sorry for my poor English. I expect your reply. Thanks again!

…

---Original--- From: "Qing ***@***.***> Date: Sat, Aug 24, 2024 17:57 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85) Hi @hao416 Sorry for the late reply. During training, we will train box prompt and point prompt in different iterations, i.e. they will not be used at the same time. 2. CLIP will add a [CLS] token to the input sentence as default and we can extract the feature of [CLS] token at the output of CLIP. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

Mountchicken · 2024-08-24T10:26:01Z

Lets say we have four labels: a yellow dog, cat, person, a giant apple. We will pass these four phrases or category names to CLIP for four times, and get their corresponding text embeddings. Here is a brief example:

a yellow dog [CLS] -> CLIP -> [CLS]
cat [CLS] -> CLIP -> [CLS]
dog [CLS] -> CLIP -> [CLS]
a giant apple [CLS] -> CLIP -> [CLS]

We concat these four text embeddings to get a tensor of shape 4XC and use them for loss computation

hao416 · 2024-08-24T10:37:13Z

oh，thank，I understand it correctly! Sorry, I also have a question. In the paper, you say that model randomly selects from 1 to n gt boxes as visual prompts. Now, if I set batch size as 2, img1 and img2. From img1,model gets 3 visual prompts. From img2, model gets 5 visual prompts. So I pad img1 3 prompts to 5 prompts for batch operation or use  python for loop to operate it 2 times. Thanks

…

---Original--- From: "Qing ***@***.***> Date: Sat, Aug 24, 2024 18:26 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85) Lets say we have four labels: a yellow dog, cat, person, a giant apple. We will pass these four phrases or category names to CLIP for four times, and get their corresponding text embeddings. Here is a brief example: a yellow dog [CLS] -> CLIP -> [CLS] cat [CLS] -> CLIP -> [CLS] dog [CLS] -> CLIP -> [CLS] a giant apple [CLS] -> CLIP -> [CLS] We concat these four text embeddings to get a tensor of shape 4XC and use them for loss computation — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

Mountchicken · 2024-08-24T10:46:52Z

indeed, we need to pad image1 to 5 prompts

hao416 · 2024-08-24T10:48:51Z

OK，thanks for your replies. All of you did a great job. Best wishes!

…

---Original--- From: "Qing ***@***.***> Date: Sat, Aug 24, 2024 18:47 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85) indeed, we need to pad image1 to 5 prompts — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

hao416 · 2024-08-26T02:44:50Z

Thanks,dear author, I'm sorry that I may have last two questions. Here, I give an example. I get 2 "cat" prompts and 3 "dog" prompts. Question 1: in visual prompt encoder, K means total number of visual prompts(5) or categories(2). I see that in contrastive  loss, you said K means categories numbers in github issue. Question 2: visual prompts.will be used as weights in class predictions, so do I need to get a mean prompts of 2 cat prompts and a mean prompts of 3 dog prompts so that model makes sure get 2 class predictions. Thanks.

…

---Original--- From: "Qing ***@***.***> Date: Mon, Aug 26, 2024 10:28 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85) Closed #85 as completed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

Mountchicken · 2024-08-26T03:10:04Z

Q1: K is the number of categories, and in your case K = 2.
Q2: If the 2 cats or 3 dogs are from one image, they will be 'averaged' by taking the aggregator token as output. If they are from different images, they will be averaged by calculating the mean value

hao416 · 2024-08-26T03:22:16Z

Ok, so you mean that if batch siz is 1, I only need to use aggregator token, namely a universal class token C' in your paper, as class prediction weighs. If batch size ＞1, I need to get every C' token and calculate mean values as final prediction weights. Right?

…

---Original--- From: "Qing ***@***.***> Date: Mon, Aug 26, 2024 11:10 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85) Q1: K is the number of categories, and in your case K = 2. Q2: If the 2 cats or 3 dogs are from one image, they will be 'averaged' by taking the aggregator token as output. If they are from different images, they will be averaged by calculating the mean value — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

Mountchicken · 2024-08-26T03:31:43Z

During the training process, we only need to use the aggregator, and this is independent of batch size. This is because, during training, we generate prompts only within the same image, meaning that the embeddings for objects like dogs and cats are used only within the current image. However, during inference, we can obtain an embedding from multiple images. For example, if we have two images, each with three dogs, we would first use the aggregator to extract the prompts for the three dogs in each image to obtain their respective embeddings. Then, we average the embeddings obtained from these two images to get the final embeddings.

hao416 · 2024-08-26T03:59:21Z

Ok, thanks, author, I understand your reply, now I'm reproducing this work, so that I'm sorry I have many problems about details. Finally, I have some points unclear combined with your replies. I also give an example : 2 cat objects and 3 dog objects in an image.   1. In paper, you say " we randomly choose between one to all available GT boxes to use as.visual prompts". Now I suppose that I get 2 cat boxes and 2 dog boxes to generate visual prompts. In visual encoder,  K means category numbers, so it means I need to sample a prompt randomly for each category(cat and dog) once again or all 4 prompts as inputs.  2. K is a fixed hyperparameter?  3. The learnable content embedding is broadcasted K times to KxD. I can't understand this "broadcast" clearly, it means original dimension of content embedding is 1xD?

…

---Original--- From: "Qing ***@***.***> Date: Mon, Aug 26, 2024 11:32 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85) During the training process, we only need to use the aggregator, and this is independent of batch size. This is because, during training, we generate prompts only within the same image, meaning that the embeddings for objects like dogs and cats are used only within the current image. However, during inference, we can obtain an embedding from multiple images. For example, if we have two images, each with three dogs, we would first use the aggregator to extract the prompts for the three dogs in each image to obtain their respective embeddings. Then, we average the embeddings obtained from these two images to get the final embeddings. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

Mountchicken · 2024-08-27T02:01:37Z

Given an image, if there are M categories, we will finally get M visual prompt embeddings for each category.
K is not a hyperparameter. It's the number of categories in current image. If you are using batch training, K will be the largest number of categories in this batch
The content embedding is in 1XD dimension and will be copied for M times to get a MXD tensor.

hao416 · 2024-08-27T02:21:19Z

OK, I read the paper and your replies again and I have understanded answer 1 and 2. Lastly, I want to make sure the form of the content embedding. In my codes, I set the content embedding = nn.Embedding(1, 256). 1. I get final vector from outputs after (msdeformattn->self attn ->ffn), namely query[:, -1, :] 2. I only copy (content embedding.weight) M times after (msdeformattn->self attn ->ffn). Thanks

Mountchicken · 2024-08-27T03:35:20Z

Here is an example. Say there are three boxes selected to get the visual prompt embedding for dog. Then you will first broadcast the content embedding for three times, and concat it with the aggregator. This will get you a 4x256 tensor. Together with position embeddings, they will pass through deform -> self attn -> ffn. And lastly, the output at the aggregator position will be used as the final visual prompt embedding.

hao416 · 2024-08-27T03:37:48Z

Ok, I got it. Thank you very much!!!

hao416 · 2024-08-28T07:35:10Z

Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?

hao416 · 2024-09-03T02:41:25Z

Dear author, I want to know how you train your model. In the table 6 of the paper, you train your model on these datasets one by one or contact them to a more large dataset.

Mountchicken · 2024-09-03T06:54:09Z

Dear author, I want to know how you train your model. In the table 6 of the paper, you train your model on these datasets one by one or contact them to a more large dataset.

We concatenate those datasets into one for training.

Mountchicken · 2024-09-03T06:55:45Z

Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?

We don't have a special process for the category id and we simply reuse the original id in its dataset.

hao416 · 2024-09-03T07:12:54Z

Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?

We don't have a special process for the category id and we simply reuse the original id in its dataset.

Ok, thanks. I notice that you use denoising training in the paper which is associated with class_num and the original id in its dataset. You know, dino's label_enc=nn.embedding(dn_labelbook_size + 1, hidden_dim). Suppose I have 2 datasets A(10 categories) and B(20 categories), do you set label_enc=nn.embedding(30 + 1, hidden_dim)? And then, if in A, id 1 is person, and in B id 1 is table, how do you deal with it? Fuse 2 datasets and sort these categories from 0 to 29? Thanks

Mountchicken · 2024-09-03T07:18:11Z

Since in the open-set task, we can not pre-assign ID to all the object categories in our datasets, so we do not compute the classification dn loss but only the box noise loss.

hao416 · 2024-09-03T08:04:09Z

Since in the open-set task, we can not pre-assign ID to all the object categories in our datasets, so we do not compute the classification dn loss but only the box noise loss.

ok, thanks

hao416 · 2024-09-03T13:19:51Z

sorry, I have another question. Features of [CLS] token in your text model is features of [EOS] token in original CLIP paper?

Mountchicken · 2024-09-04T14:04:01Z

Yes. If your are using CLIP from huggingface, then you can get the [CLS] token like this:

model = CLIPTextModel.from_pretrained(pretrained_name)
outputs = model(**inputs)
pooled_feature = outputs.pooler_output

hao416 · 2024-09-04T14:12:23Z

OK, thank you very much.You help me a lot.

…

---Original--- From: "Qing ***@***.***> Date: Wed, Sep 4, 2024 22:04 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85) Yes. If your are using CLIP from huggingface, then you can get the [CLS] token like this: model = CLIPTextModel.from_pretrained(pretrained_name) outputs = model(**inputs) pooled_feature = outputs.pooler_output — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

hao416 · 2024-09-05T02:33:14Z

Dear author, in visual prompt encoder, I define parameters of box and point prompts. I notice that you say you train box prompt and point prompt in different iterations, but I have problems with torch when I use multiple gpus. It shows that some parameters do not receive gradients. So my question is that I need to freeze some weights in different iterations? Thanks

Mountchicken · 2024-09-05T11:11:18Z

Hi @hao416
There are two solutions. The first one is to set find_unused_parameters=True. Here is an example

model = torch.nn.parallel.DistributedDataParallel(
            model,
            device_ids=[args.gpu],
            find_unused_parameters=True)

The second one is to add the parameter of the unused module to the computation. Here is an example

box_embedding_layer = nn.Linear(4, 256)
point_embedding_layer = nn.Linear(2,256)
# for box iteration
embedding = box_embedding_layer(box)
for param in point_embedding_layer.parameters():
     embedding = embedding+ param.sum() * 0.0
# for point iteration
embedding = point_embedding_layer(point)
for param in box_embedding_layer.parameters():
     embedding = embedding+ param.sum() * 0.0

hao416 · 2024-09-05T11:18:42Z

Thanks, I got it.And I search answers in the internet, I find that it can freeze weights in different iterations. Additionally, I ever tried  the first solution but it can't work well.Thank you again.

…

---Original--- From: "Qing ***@***.***> Date: Thu, Sep 5, 2024 19:11 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85) Hi @hao416 There are two solutions. The first one is to set find_unused_parameters=True. Here is an example model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True) The second one is to add the parameter of the unused module to the computation. Here is an example box_embedding_layer = nn.Linear(4, 256) point_embedding_layer = nn.Linear(2,256) # for box iteration embedding = box_embedding_layer(box) for param in point_embedding_layer.parameters(): embedding = embedding+ param.sum() * 0.0 # for point iteration embedding = point_embedding_layer(point) for param in box_embedding_layer.parameters(): embedding = embedding+ param.sum() * 0.0 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

CatfishW · 2024-09-18T18:45:31Z

def visual_prompt_cross_attention(self, support_feat,memory, query_mask_flatten):
Q = self.content_embedding.weight[None, :]
#expand to the same size as support_feat
Q = Q.expand(support_feat.shape[0], support_feat.shape[1], support_feat.shape[2])
Q_ = self.cross_attention_vp(self.with_pos_embed(Q.transpose(0,1),support_feat.transpose(0,1)), memory.transpose(0,1), memory.transpose(0,1),query_mask_flatten)[0].transpose(0,1)
Q = Q + self.cross_attention_vp_dropout(Q_)
Q = self.cross_attention_vp_norm(Q)
q = k = self.with_pos_embed(Q, support_feat)
Q_, _ = self.self_attn(q, k, value=Q, attn_mask=None)
Q = Q + self.dropout_post(Q_)
support_feat = self.norm_post(Q)
return support_feat
作者大大好，我照着你的结构复现了一部分内容，可以麻烦帮忙看看这个关于cross attention提取提示特征的函数写的对吗

CatfishW · 2024-09-18T18:46:23Z

def visual_prompt_cross_attention(self, support_feat,memory, query_mask_flatten): Q = self.content_embedding.weight[None, :] #expand to the same size as support_feat Q = Q.expand(support_feat.shape[0], support_feat.shape[1], support_feat.shape[2]) Q_ = self.cross_attention_vp(self.with_pos_embed(Q.transpose(0,1),support_feat.transpose(0,1)), memory.transpose(0,1), memory.transpose(0,1),query_mask_flatten)[0].transpose(0,1) Q = Q + self.cross_attention_vp_dropout(Q_) Q = self.cross_attention_vp_norm(Q) q = k = self.with_pos_embed(Q, support_feat) Q_, _ = self.self_attn(q, k, value=Q, attn_mask=None) Q = Q + self.dropout_post(Q_) support_feat = self.norm_post(Q) return support_feat 作者大大好，我照着你的结构复现了一部分内容，可以麻烦帮忙看看这个关于cross attention提取提示特征的函数写的对吗

Mountchicken · 2024-09-20T07:30:12Z

@CatfishW
Sorry for the late reply. The implementation looks fine to me. As for detailed implementation, you can refer to this code:
https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/models/GroundingDINO/transformer.py#L802

hao416 · 2024-10-18T08:09:27Z

Dear author, I have a question that do you freeze weights of the text prompt encoder during training, thanks

Mountchicken · 2024-10-18T08:41:39Z

Hi @hao416
We don't freeze the CLIP text encoder during training.

hao416 · 2024-10-18T09:07:36Z

Hi @hao416 We don't freeze the CLIP text encoder during training.

OK，thanks for your reply, but I have two questiones:

Recent works show that if models do not freeze CLIP text encoder, it may perturbmodel weights and interfere final performance. So do you study corresponding impacts?
I notice that you ever mentioned 8 epoches for visual prompt and 1 epoch for text prompt. Now I try to reproduce your model but I change this setting by 4 epoches for visual prompt and 1 epoch for text prompt limited by the number of GPU devices. I find that results of text prompts can not be improved based on that of visual prompt. For example, I assume mAP is 18.0 after 4 epoches with visual prompts, but mAP may be 11.0 after first epoch with text prompts, namely 5th epoch in total numbers. It seems like that the whole model is trained from scratch. Is this normal？How many epoches you use?
Thanks

Mountchicken · 2024-10-18T10:03:59Z

We tried freezing the clip and fine-tuning the clip, and found that there was no particular difference between the two, and that fine-tuning performs a little better.
We are training with 8 iterations of visual prompts and then one iteration of text prompts instead of epoch.

hao416 · 2024-10-18T10:31:04Z

2. s

OK, I misunderstood it before and I got it now. So I only need to choose specific prompts such as visual or text prompts to get final detection results at test time, right?

Mountchicken · 2024-10-18T10:32:32Z

Yes. During inference you can you either text prompt or visual prompt

hao416 · 2024-10-18T10:34:52Z

OK,thank you very much

…

---Original--- From: "Qing ***@***.***> Date: Fri, Oct 18, 2024 18:32 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85) Yes. During inference you can you either text prompt or visual prompt — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

hao416 · 2024-10-22T01:53:17Z

Yes. During inference you can you either text prompt or visual prompt

Dear author, I want to consult a question about training. I give an example to demonstrate it : You mentioned o365 /Goldg datesets for text prompt traing and o365/openimages for visual prompt training in paper. The question is that when train for text prompts in o365, the model has missed the first 8 iterations images for text prompt traing because they are trained for visual prompt traing. I want to know how to cope with different iterations in one forward process.

In dino framework, how to deal with it in for loop, namely :
for samples, targets in metric_logger.log_every(data_loader, print_freq, header, logger=logger):
xxxxxxxxxxxxxxxxx

thank you.

Mountchicken · 2024-10-22T02:49:14Z

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      
      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

hao416 · 2024-10-22T03:12:15Z

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:
iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      
      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?

Mountchicken closed this as completed Aug 26, 2024

Mountchicken reopened this Sep 4, 2024

fuweifu-vtoo mentioned this issue Sep 12, 2024

About the Visual Prompt #87

Closed

pisiguiii mentioned this issue Sep 23, 2024

About Visual Prompt Encoder. #83

Open

About Visual Prompt Encoder and Contrastive Alignment #85

About Visual Prompt Encoder and Contrastive Alignment #85

Comments

hao416 commented Aug 23, 2024

Mountchicken commented Aug 24, 2024

hao416 commented Aug 24, 2024 via email

Mountchicken commented Aug 24, 2024

hao416 commented Aug 24, 2024 via email

Mountchicken commented Aug 24, 2024

hao416 commented Aug 24, 2024 via email

hao416 commented Aug 26, 2024 via email

Mountchicken commented Aug 26, 2024

hao416 commented Aug 26, 2024 via email

Mountchicken commented Aug 26, 2024

hao416 commented Aug 26, 2024 via email

Mountchicken commented Aug 27, 2024

hao416 commented Aug 27, 2024 • edited Loading

Mountchicken commented Aug 27, 2024

hao416 commented Aug 27, 2024

hao416 commented Aug 28, 2024

hao416 commented Sep 3, 2024

Mountchicken commented Sep 3, 2024

Mountchicken commented Sep 3, 2024

hao416 commented Sep 3, 2024 • edited Loading

Mountchicken commented Sep 3, 2024

hao416 commented Sep 3, 2024

hao416 commented Sep 3, 2024

Mountchicken commented Sep 4, 2024

hao416 commented Sep 4, 2024 via email

hao416 commented Sep 5, 2024

Mountchicken commented Sep 5, 2024

hao416 commented Sep 5, 2024 via email

CatfishW commented Sep 18, 2024

CatfishW commented Sep 18, 2024

Mountchicken commented Sep 20, 2024

hao416 commented Oct 18, 2024

Mountchicken commented Oct 18, 2024

hao416 commented Oct 18, 2024

Mountchicken commented Oct 18, 2024

hao416 commented Oct 18, 2024

Mountchicken commented Oct 18, 2024

hao416 commented Oct 18, 2024 via email

hao416 commented Oct 22, 2024

Mountchicken commented Oct 22, 2024

hao416 commented Oct 22, 2024

hao416 commented Aug 27, 2024 •

edited

Loading

hao416 commented Sep 3, 2024 •

edited

Loading