This repository contains the code for the TMLR paper Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning, by Maurits Bleeker1, Mariya Hendriksen1, Andrew Yates1, and Maarten de Rijke1.
The implementation builds upon the codebase of Latent Target Decoding.
1University of Amsterdam, The Netherlands
- Jul 2024: The paper has been accepted by TMLR
- Feb 2024: Initial arXiv release
To set up the environment, install the requirements using the provided YAML file:
conda env create --file src/environment.yaml
This command will create a conda environment contrastive-shortcuts
. Activate the created environment:
source activate contrastive-shortcuts
For local development, execute the following command:
python src/trainer.py --yaml_file src/configs/{f30k, coco}/development_local.yaml
To train a model run python src/trainer.py
and provide a base config in YAML format using --yaml_file <config path.yaml>
.
Hyperparameters can be overridden using command line flags. For example:
python src/trainer.py --yaml_file src/configs/f30k/development_local.yaml --experiment.wandb_project <your project name>
The recommended approach is to have a fixed base config for each experiment and only modify specific hyperparameters for different training/evaluation settings.
All training and evaluation were conducted using a SLURM-based scheduling system.
We implemented a PyTorch Dataloader class that loads the images from the memory of the compute node the training runs on. The captions are loaded from either the Flickr30k or MS-COCO annotation file.
Update the *.yaml config with the right file paths.
img_path:
annotation_file:
annotation_path:
To create the vocabulary class, run:
python utils/vocab.py
With the appropriate input flags.
Job and hyperparameter files to reproduce experiments can be found in src/jobs/{coco, f30k}/
.
The shortcut experiments (Section 4) are available in the shortcuts
folder, the LTD experiments in the LTD
folder, and the IFM experiments in the 'IFM' folder (Section 6).
To reproduce results from Section 3, run the following evaluation script (ensure correct file paths).
sbatch src/jobs/{coco, f30k}/snellius/shortcuts/{clip, vse}/{clip, vse}_{coco, f30k}_shortcut_experiments_eval.job
Next, copy all the RSUM values to notebooks/visualizations/visualization.ipynb
to generate the plot.
The results from Section 6 are generated by using notebooks/Evaluation.ipynb
.
If you find this repository helpful, feel free to cite our paper "Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning":
@article{bleeker-2024-demonstrating,
title={Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning},
author={Bleeker, Maurits and Hendriksen, Mariya and Yates, Andrew and de Rijke, Maarten},
journal={Transactions on Machine Learning Research},
url={https://openreview.net/forum?id=gfANevPraH},
year={2024}
}