Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about two stage training in PETR #11

Open
flyinglynx opened this issue Aug 5, 2022 · 6 comments
Open

Questions about two stage training in PETR #11

flyinglynx opened this issue Aug 5, 2022 · 6 comments

Comments

@flyinglynx
Copy link

Thank you for sharing the code! I notice that, PETR is set to be two-stage in the code, i.e., top K proposals from encoder output are selected as the query embedding as well as the initial reference point in the decoder. This is also very similar to the two-stage version of Deformable-DETR.

However, in section 3.3 of the paper, the authors mentioned that the query embeddings are randomly initialized and learnt, which is not a two-stage way. I wonder if the reported results are from two-stage models or one-stage ones. Besides, how much improvement can the two-stage variant bring?

@dae-sun
Copy link

dae-sun commented Aug 5, 2022

In PETR, they used randomly initialized queries without query positional encoding from reference points. So, section 3.3 seems correct with the code of this repo.

@dae-sun
Copy link

dae-sun commented Aug 5, 2022

while two-stage deformable DETR embedded their queries with their initial bboxes.

@flyinglynx
Copy link
Author

In PETR, they used randomly initialized queries without query positional encoding from reference points. So, section 3.3 seems correct with the code of this repo.

Thank you for your answer! I check the code and i think that, the two-stage mode (default setting in the code) denotes that the initial reference points for the decoder are initialized from the top 100 proposals, while the query embedding vectors are still randomly initialized.

I check the code in opera/models/utils/transformer.py, line 856-908, encoder features with high confidence are selected as proposals, and K keypoint coordinates are predicted. The keypoints are used as initial reference points in the decoder (Note that the deformable cross-attention use 17 reference points). Hence, i think there are still some difference between section 3.3, where the locations of initial reference points are randomly initialized and learnt. I am a little curious here how much improvement this modification can bring. Actually, such setting is quite reasonable.

This setting is very close to the recent DINO, where they only use positional information from encoder proposals and use randomly initialized content vectors for query. DINO says this will yield better performance.

@dae-sun
Copy link

dae-sun commented Aug 5, 2022

Hence, I think there is still some difference between section 3.3, where the locations of initial reference points are randomly initialized and learned.
-> sorry, I checked it. The initial reference point P0 is a randomly-initialized matrix and jointly updated with the model parameters during training. I also think it's weird.

This setting is very close to the recent DINO, where they only use positional information from encoder proposals and use randomly initialized content vectors for queries. DINO says this will yield better performance.
-> The DINO uses mixed query selection that set initial reference points as content queries and uses randomly initialized positional encoding while this repo set randomly initialized values as content queries also.

Thank you for your feedback :)

@dae-sun
Copy link

dae-sun commented Aug 5, 2022

I check the code and I think that the two-stage mode (default setting in the code) denotes that the initial reference points for the decoder are initialized from the top 100 proposals, while the query embedding vectors are still randomly initialized.

-> Yes, I think so too!

@flyinglynx
Copy link
Author

The DINO uses mixed query selection that set initial reference points as content queries and uses randomly initialized positional encoding while this repo set randomly initialized values as content queries also.

-> I have not finished reading DINO's code. But I think the idea of only passing position information of proposals are quite similar here. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants