Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

isaac-chung · 2024-12-16T08:59:11Z

AlexeyVatolin · 2024-12-17T12:34:29Z

Hello,

I conducted a comparison of the models using the examples provided in the readme.md file for each model. Here's a summary of my findings:

Alibaba-NLP/gte-Qwen2-7B-instruct
Alibaba-NLP/gte-Qwen1.5-7B-instruct
Alibaba-NLP/gte-Qwen2-1.5B-instruct
Linq-AI-Research/Linq-Embed-Mistral

For these models, I found that all three implementations (i.e., Transformers AutoModel, sentence_transformers, and mteb) are exactly the same. This consistency is great to see.
nvidia/NV-Embed-v2
nvidia/NV-Embed-v1

In these cases, the official implementation of Transformers AutoModel differs from the official sentence_transformers implementation, which is unexpected. The implementation in mteb aligns completely with sentence_transformers.

I also wanted to share the code I used for this comparison: View the Gist

Please note that questions regarding the correctness of prompt usage were not within the scope of this study. However, it does highlight that the models added to mteb are correctly implemented.

P.S.
I created a discussion in the nvidia repository about this problem

AlexeyVatolin · 2024-12-17T15:04:17Z

Qwen model repository includes a script to calculate scores for their models on the MTEB benchmark. I ran this script on the same tasks covered in my pull request.

The results from the original script are, in most cases, worse than those reported on the leaderboard and also fall short when compared to results obtained using the code from the MTEB models.
Here is my command to run this script

OPENBLAS_NUM_THREADS=8 python scripts/eval_mteb.py -m Alibaba-NLP/gte-Qwen2-1.5B-instruct --output_dir results_qwen_2_1.5b_eval_mteb --task mteb

Additionally, there is a open discussion about this on the Qwen model repository.

Classification

	source	AmazonCounterfactualClassification	EmotionClassification	ToxicConversationsClassification
gte-Qwen1.5-7B-instruct	Leaderboard	83.16	54.53	78.75
gte-Qwen1.5-7B-instruct	Pull request	81.78	54.91	77.25
gte-Qwen1.5-7B-instruct	Original script	67.87	46.08	59.06
gte-Qwen2-1.5B-instruct	Leaderboard	83.99	61.37	82.66
gte-Qwen2-1.5B-instruct	Pull request	82.51	65.66	84.54
gte-Qwen2-1.5B-instruct	Original script	71.81	54.56	65.1

Clustering

	source	ArxivClusteringS2S	RedditClustering
gte-Qwen1.5-7B-instruct	Leaderboard	51.45	73.37
gte-Qwen1.5-7B-instruct	Pull request	53.57	80.12
gte-Qwen1.5-7B-instruct	Original script	47.88	64.43
gte-Qwen2-1.5B-instruct	Leaderboard	45.01	55.82
gte-Qwen2-1.5B-instruct	Pull request	44.61	51.36
gte-Qwen2-1.5B-instruct	Original script	41.1	52.53

PairClassification

	source	SprintDuplicateQuestions	TwitterSemEval2015
gte-Qwen1.5-7B-instruct	Leaderboard	96.07	79.36
gte-Qwen1.5-7B-instruct	Pull request	94.51	80.72
gte-Qwen1.5-7B-instruct	Original script	91.44	61.92
gte-Qwen2-1.5B-instruct	Leaderboard	95.32	79.64
gte-Qwen2-1.5B-instruct	Pull request	91.19	75.93
gte-Qwen2-1.5B-instruct	Original script	93.87	74.59

Reranking

	source	SciDocsRR	AskUbuntuDupQuestions
gte-Qwen1.5-7B-instruct	Leaderboard	87.89	66
gte-Qwen1.5-7B-instruct	Pull request	88.26	64.03
gte-Qwen1.5-7B-instruct	Original script	85.2	57.32
gte-Qwen2-1.5B-instruct	Leaderboard	86.52	64.55
gte-Qwen2-1.5B-instruct	Pull request	85.67	62.33
gte-Qwen2-1.5B-instruct	Original script	83.51	60.47

Retrieval

	source	SCIDOCS	SciFact
gte-Qwen1.5-7B-instruct	Leaderboard	27.69	75.31
gte-Qwen1.5-7B-instruct	Pull request	26.34	75.8
gte-Qwen1.5-7B-instruct	Original script	22.38	74.34
gte-Qwen2-1.5B-instruct	Leaderboard	24.98	78.44
gte-Qwen2-1.5B-instruct	Pull request	23.4	77.47
gte-Qwen2-1.5B-instruct	Original script	21.92	75.81

STS

	source	STS16	STSBenchmark
gte-Qwen1.5-7B-instruct	Leaderboard	86.39	87.35
gte-Qwen1.5-7B-instruct	Pull request	85.98	86.86
gte-Qwen1.5-7B-instruct	Original script	81.33	83.65
gte-Qwen2-1.5B-instruct	Leaderboard	85.45	86.38
gte-Qwen2-1.5B-instruct	Pull request	84.71	84.71
gte-Qwen2-1.5B-instruct	Original script	85.35	86.04

Summarization

	source	SummEval
gte-Qwen1.5-7B-instruct	Leaderboard	31.46
gte-Qwen1.5-7B-instruct	Pull request	31.22
gte-Qwen1.5-7B-instruct	Original script	30.07
gte-Qwen2-1.5B-instruct	Leaderboard	31.17
gte-Qwen2-1.5B-instruct	Pull request	30.5
gte-Qwen2-1.5B-instruct	Original script	28.99

KennethEnevoldsen · 2024-12-22T20:32:02Z

Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.

@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them

afalf · 2024-12-24T08:42:37Z

Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.

@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them

I'm a member of the gte-Qwen series model. Sorry, we checked and found some errors in the previous script. It have now been updated and verified to be consistent with the results on the leaderboard. Please try again with the latest script to check the results.

AlexeyVatolin · 2024-12-24T11:44:59Z

@afalf, thanks a lot! I've run the gte-Qwen models with the updated script and will post as soon as I have results

isaac-chung mentioned this issue Dec 16, 2024

Add new models nvidia, gte, linq #1436

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

isaac-chung commented Dec 16, 2024

AlexeyVatolin commented Dec 17, 2024 •

edited

Loading

AlexeyVatolin commented Dec 17, 2024 •

edited

Loading

KennethEnevoldsen commented Dec 22, 2024

afalf commented Dec 24, 2024

AlexeyVatolin commented Dec 24, 2024

Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

Investigate performance discrepancies in gte-Qwen and NV-embed models #1600

Comments

isaac-chung commented Dec 16, 2024

AlexeyVatolin commented Dec 17, 2024 • edited Loading

AlexeyVatolin commented Dec 17, 2024 • edited Loading

Classification

Clustering

PairClassification

Reranking

Retrieval

STS

Summarization

KennethEnevoldsen commented Dec 22, 2024

afalf commented Dec 24, 2024

AlexeyVatolin commented Dec 24, 2024

AlexeyVatolin commented Dec 17, 2024 •

edited

Loading

AlexeyVatolin commented Dec 17, 2024 •

edited

Loading