Downstream task #51

ankan8145 · 2023-10-29T12:39:13Z

After train the model can we use only target-encoder for down-stream task ?? like- image captioning etc.

VimukthiRandika1997 · 2024-03-20T15:20:09Z

You can use Encoder and not the target encoder for the task. Because during the training, we trained the encoder model to predict the masked regions given an unmasked context. Hence using encoder would be the choice!

FalsoMoralista · 2024-04-08T18:27:11Z

@VimukthiRandika1997 I was thinking about this in a similar way although the paper says "We use the target-encoder for evaluation and average pool its output to produce a global image representation." You can check this on the first paragraph in the paper appendix (A.1. Pretraining). Can you please give a check on this?

VimukthiRandika1997 · 2024-04-10T16:30:41Z

@VimukthiRandika1997 I was thinking about this in a similar way although the paper says "We use the target-encoder for evaluation and average pool its output to produce a global image representation." You can check this on the first paragraph in the paper appendix (A.1. Pretraining). Can you please give a check on this?

Yeah, I looked into that. I think in this case It makes sense to use Target Encoder for the evaluation. Main reason might be the target encoder can learn all possible semantics of within images given some image context(blocks). On the other hand, context encoder only learns how to represent the given image context.

I was mainly inspired by previous approach called BYOL, here we used Online Encoder(similar to context encoder) after training. We can try out Context Encoder and see the results as well since both checkpoints are available!

FalsoMoralista · 2024-04-10T16:55:51Z

@VimukthiRandika1997 I was thinking about this in a similar way although the paper says "We use the target-encoder for evaluation and average pool its output to produce a global image representation." You can check this on the first paragraph in the paper appendix (A.1. Pretraining). Can you please give a check on this?

Yeah, I looked into that. I think in this case It makes sense to use Target Encoder for the evaluation. Main reason might be the target encoder can learn all possible semantics of within images given some image context(blocks). On the other hand, context encoder only learns how to represent the given image context.

I was mainly inspired by previous approach called BYOL, here we used Online Encoder(similar to context encoder) after training. We can try out Context Encoder and see the results as well since both checkpoints are available!

That makes a lot of sense, really nice intuition. They actually do test both approaches (also shown in the appendix) for reconstruction but personally I couldn't find the conclusions visually intuitive.

Ps: me and other folks are uniting to reproduce some of the experiments, mess around with the architecture, etc. If you want to join add me on discord: falsomoralista.

Spidartist · 2024-05-31T12:38:32Z

hello @FalsoMoralista, I'm currently interested in pretraining IJEPA and finetuning on that pretrained model on semantic segmentation task, can I join with you?
this is my discord info: spidartist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downstream task #51

Downstream task #51

ankan8145 commented Oct 29, 2023

VimukthiRandika1997 commented Mar 20, 2024

FalsoMoralista commented Apr 8, 2024

VimukthiRandika1997 commented Apr 10, 2024

FalsoMoralista commented Apr 10, 2024 •

edited

Loading

Spidartist commented May 31, 2024

Downstream task #51

Downstream task #51

Comments

ankan8145 commented Oct 29, 2023

VimukthiRandika1997 commented Mar 20, 2024

FalsoMoralista commented Apr 8, 2024

VimukthiRandika1997 commented Apr 10, 2024

FalsoMoralista commented Apr 10, 2024 • edited Loading

Spidartist commented May 31, 2024

FalsoMoralista commented Apr 10, 2024 •

edited

Loading