Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormally low values for NanoBEIR benchmark #1627

Open
minsik-ai opened this issue Dec 24, 2024 · 4 comments
Open

Abnormally low values for NanoBEIR benchmark #1627

minsik-ai opened this issue Dec 24, 2024 · 4 comments

Comments

@minsik-ai
Copy link

Continuing from #1588

NanoBEIR performance on Touche2020 and NFCorpus is too low compared to reported values.

You can check out some of the values here: embeddings-benchmark/results#72

@isaac-chung
Copy link
Collaborator

@minsik-ai could you please specify:

  1. which model you tried, the script and/or commands you used,
  2. the corresponding results file in that PR you linked, and
  3. what values (metrics) you're comparing

Thanks in advance!

@Samoed
Copy link
Collaborator

Samoed commented Dec 24, 2024

The original blog only presents results for e5-mistral based models, and it's hard to evaluate because we don't know which prompts were used during testing. I think @ArthurCamara might be able to share some insights on how they evaluated models on NanoBEIR.

@Samoed
Copy link
Collaborator

Samoed commented Dec 24, 2024

I've evaluated multilingual-e5-small on mteb NanoBEIR and sentence transformers. Code. Here scores is ndcg@10

Task MTEB Sentece Transformers
NanoArguAna 0.44536 0.444486
NanoClimateFever 0.2222 0.30642
NanoDBPedia 0.17534 0.6053
NanoFever 0.80845 0.30642
NanoFiQA2018 0.34363 0.4430
NanoHotpotQA 0.56911 0.81012
NanoMSMARCO 0.62091 0.62091
NanoNFCorpus 0.05535 0.2885
NanoNQ 0.67664 0.68618
NanoQuora 0.90621 0.97279
NanoSCIDOCS 0.20826 0.34377
NanoSciFact 0.71129 0.72457
NanoTouche2020 0.19598 0.49540

Not matching results:

  • NanoArguAna
  • NanoClimateFever
  • NanoDBPedia
  • NanoFever
  • NanoFiQA2018
  • NanoHotpotQA
  • NanoNFCorpus
  • NanoQuora
  • NanoSCIDOCS
  • NanoTouche2020

Matching results:

  • NanoArguAna
  • NanoMSMARCO
  • NanoSciFact (diff 0.01)

@minsik-ai
Copy link
Author

@Samoed 's findings is the main difference I've seen!
You can see NFCorpus is at 0.05 range for MTEB, compared to Sentence Transformers which have 0.2 range.
I've also run additional experiments with intfloat/e5-mistral-7b-instruct and have seen similar performance degradation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants