Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

Training from scratch #55

Open
Ugenteraan opened this issue Mar 10, 2024 · 13 comments
Open

Training from scratch #55

Ugenteraan opened this issue Mar 10, 2024 · 13 comments

Comments

@Ugenteraan
Copy link

Hey everyone!

First off, thanks for the great work. I implemented my own version of I-JEPA (https://github.com/Ugenteraan/I-JEPA) by referencing to this repository.

I used the Doges 77 Breeds (https://www.kaggle.com/datasets/madibokishev/doges-77-breeds) dataset for the training. The loss goes down in a convincing manner during the SSL training. However during the downstream, when I load the pre-trained weights from the encoder and use probing, the accuracy is no better than a randomly initialized encoder weights.

Does anyone have a clue on what might have been the cause of this?

Thanks in advance! Cheers.

@FalsoMoralista
Copy link

FalsoMoralista commented Apr 8, 2024

Did you checked the paper's appendix? I guess that you could find more intuition with respect to things specially linear probing.
Did you perform avg pool on the encoder output? (i.e., "We use the target-encoder for evaluation and average pool its output to produce a global image representation.").
What is the dimensionality of the encoder output embeddings that you are using? After pooling the target encoder output I managed to modify it into (batch_size, 1280) but I have seen people using (batch_size, 256) and I'm still not sure yet about which one is the appropriate one (similar issue).

I was having a similar issue in which loss wasn't decreasing, then I realized that I was initializing the optimizer with the model parameters before adding the linear head to it, therefore the parameters related to classification wasn't getting accounted into the optimizer. That solved I'm still struggling to train it in a supervised fashion.

I would suggest that we try to unite into a communication channel such as discord or something to share progress about this stuff.

@Ugenteraan
Copy link
Author

Hey @FalsoMoralista thanks for the comment! Sure, let's take this to discord. My handle is johnweak15. Do add me there!

@bdytx5
Copy link

bdytx5 commented Jun 27, 2024

You all able to solve this? - Brett

@FalsoMoralista
Copy link

FalsoMoralista commented Jul 1, 2024

@bdytx5 yes we did. @lazarosgogos was also able to conduct some insightful experiments with it as well. What do you wanted to know specifically?

@bdytx5
Copy link

bdytx5 commented Jul 1, 2024

Well, I tried training with IJEPA on cifar10 and then using the pretrained model to fine tune on cifar10, using the labels during fine tuning (with just the target encoder). I compared the fine tuning to randomly initialized model, and the results seemed to be the same. I averaged the output embeddings of the last layer. Does this seem strange? Note I just used the tiny_vit.

@FalsoMoralista
Copy link

FalsoMoralista commented Jul 2, 2024

Curious! For how many epochs did you pre-trained over cifar10?

@bdytx5
Copy link

bdytx5 commented Jul 2, 2024

Around 10 or so, as after that train and validation loss began to rise.

@lazarosgogos
Copy link

If you've left the config file untouched, there is most likely a warmup period of eg. 40 epochs out of the 300 in total. The loss going up after some epochs (depends on your configuration and total number of epochs) is a normal behavior as mentioned in #41.

Try letting your model train for at lest 50-60 epochs (with appropriate changes in the configuration) and then try a downstream task. In the early epochs the model doesn't learn semantic representations of the data, even though the loss seems goes down (I've tested this personally).

Once I get to test the ViT-tiny and ViT-small models, I will get back with the differences.

@bdytx5
Copy link

bdytx5 commented Jul 4, 2024

Ah, I overlooked this. Good catch

@akshayneema
Copy link

@lazarosgogos Did you get to test ViT-small models? It'd be really helpful if you could share the working configuration for those. Thanks.

@lazarosgogos
Copy link

@lazarosgogos Did you get to test ViT-small models? It'd be really helpful if you could share the working configuration for those. Thanks.

@akshayneema It heavily depends on what type of resources (e.g. GPUs) you have at hand. The more VRAM you have, the bigger the model you can load. The bigger the images you use, the larger the VRAM you'll need

For example,to train on ImageNet's images, with a ViT-small model, on 16GB VRAM, I was able to load at most a batch of 60 images per iteration (the rest of the config was untouched)

@akshayneema
Copy link

Thanks for the reply @lazarosgogos

Can you also share how were the results like for you using ViT-small? Were the results competitive with ViT-H or ViT-G? Did you also change the predictor model architecture to suit ViT-small architecture?

I am currently using 1 GeForce RTX 3090 GPU training with 32 batch size. I am using UMAP to visualise the embeddings generated by the target-encoder and it does not look that great.

@lazarosgogos
Copy link

@akshayneema The results using ViT-small were not competitive with ViT-huge or ViT-Giant, not even close. The difference in some linear probing tasks was immense (>30%).

The point of using ViT-small or ViT-base is mostly, in my opinion, to run tests and see how the model performs, in order to then train a ViT-Huge for final results.

I did not touch the architecture of the predictor when testing how ViT-small behaves. Batch size plays a role in training as well, keep that in mind.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants