[mieb] Fix siglip bug & add retrieval datasets #1424

gowitheflow-1998 · 2024-11-10T03:03:26Z

fixed SigLIP text tokenizer bug. padding=True->padding="max_length". Turns out this led to significant performance drop in zero-shot classification and retrieval that has texts (comparing the results here and https://github.com/embeddings-benchmark/tmp, e.g., MNISTZeroShot ~10 accuracy -> 80+ accuracy). Reference: https://github.com/huggingface/transformers/blob/v4.46.2/src/transformers/pipelines/zero_shot_image_classification.py#L147
Remove normalization in SigLIP single-modality encoding which constantly hurts linear probing performance (comparing CIFAR and MNIST results here and https://github.com/embeddings-benchmark/tmp).
Fix [mieb] google/siglip-large-patch16-384 fails on ImageNet10Clustering #1417
added EDIS and GLV-v2 I2I retrieval.

cc @Muennighoff

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

isaac-chung

Nice! I do wonder if points 1 and 2 should be applied to any other models 🤔

gowitheflow-1998 · 2024-11-10T11:38:10Z

Nice! I do wonder if points 1 and 2 should be applied to any other models 🤔

looks like the tokenizer thing a siglip only thing! https://github.com/huggingface/transformers/blob/v4.46.2/src/transformers/pipelines/zero_shot_image_classification.py#L147

re normalization, looks like all other models by deafult do not normalize embeddings in single-modality encoding in our implementation, although we can potentially investigate normalizing them before fusion I think!

gowitheflow-1998 added 5 commits November 10, 2024 02:40

fix siglip

9783cf1

add edis&gld-v2 i2i

19f435c

results

ee336ff

siglip updated results

f8f9a0d

fix siglip non-dataloader tasks

3a78bee

isaac-chung approved these changes Nov 10, 2024

View reviewed changes

gowitheflow-1998 merged commit f60465a into mieb Nov 10, 2024
10 checks passed

gowitheflow-1998 deleted the fix-siglip-add-datasets branch November 10, 2024 12:07

This was referenced Nov 10, 2024

[mieb] google/siglip-large-patch16-384 fails on ImageNet10Clustering #1417

Open

[mieb] mieb scripts (siglip rerun & linear probing ablation & params count) #1429

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mieb] Fix siglip bug & add retrieval datasets #1424

[mieb] Fix siglip bug & add retrieval datasets #1424

gowitheflow-1998 commented Nov 10, 2024 •

edited

Loading

isaac-chung left a comment

gowitheflow-1998 commented Nov 10, 2024 •

edited

Loading

[mieb] Fix siglip bug & add retrieval datasets #1424

[mieb] Fix siglip bug & add retrieval datasets #1424

Conversation

gowitheflow-1998 commented Nov 10, 2024 • edited Loading

Checklist

isaac-chung left a comment

Choose a reason for hiding this comment

gowitheflow-1998 commented Nov 10, 2024 • edited Loading

gowitheflow-1998 commented Nov 10, 2024 •

edited

Loading

gowitheflow-1998 commented Nov 10, 2024 •

edited

Loading