[mieb] Fix siglip bug & add retrieval datasets #1424
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixed SigLIP text tokenizer bug.
padding=True
->padding="max_length"
. Turns out this led to significant performance drop in zero-shot classification and retrieval that has texts (comparing the results here and https://github.com/embeddings-benchmark/tmp, e.g., MNISTZeroShot ~10 accuracy -> 80+ accuracy). Reference: https://github.com/huggingface/transformers/blob/v4.46.2/src/transformers/pipelines/zero_shot_image_classification.py#L147Remove normalization in SigLIP single-modality encoding which constantly hurts linear probing performance (comparing CIFAR and MNIST results here and https://github.com/embeddings-benchmark/tmp).
Fix [mieb] google/siglip-large-patch16-384 fails on ImageNet10Clustering #1417
added EDIS and GLV-v2 I2I retrieval.
cc @Muennighoff
Checklist
make test
.make lint
.