Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update COIR default.jsonl #27

Closed
wants to merge 1 commit into from

Conversation

archersama
Copy link

I have updated the CoIR leaderboard scores on the MTEB benchmark. Additionally, I have executed the apps.py script to ensure the content is displayed correctly.

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Sep 5, 2024

Hmm I am not too familiar with the leaderboard, but this seems like a manual input of the results (@Muennighoff). Shouldn't this be derived from the results repository?

@archersama
Copy link
Author

Hmm I am not too familiar with the leaderboard, but this seems like a manual input of the results (@Muennighoff). Shouldn't this be derived from the results repository?

I noticed that CoIR was incorrectly labeled with only two datasets, I changed the label to include all ten datasets so that it now shows the correct datasets. I additionally uploaded all the results manually. So I should upload the model performance to the results repository?

@Samoed
Copy link
Contributor

Samoed commented Sep 5, 2024

Yes. And after that you should delete EXTERNAL_MODEL_RESULTS.json if model presented here and run refresh.py

@KennethEnevoldsen
Copy link
Contributor

But shouldn't it be specified in the config and then be auto-updated (@orionw adding you here well since you have more experience with the leaderboard).

Following the instruction for a leaderboard tab, it doesn't seem like this is the intended approach. It might be that the docs are outdated.

@orionw
Copy link
Collaborator

orionw commented Sep 5, 2024

Thanks @KennethEnevoldsen! Yes this file will get over-written when the nightly CI occurs.

It is definitely possible that some docs are outdated, sorry about that!

noticed that CoIR was incorrectly labeled with only two datasets, I changed the label to include all ten datasets so that it now shows the correct datasets.

@archersama this should be done in the config that Kenneth referenced, let me know if you have any troubles with it!

I additionally uploaded all the results manually. So I should upload the model performance to the results repository?

Yes, uploading them to the results repository would be the way to go! Or if the results are attached to models you control on HF, you can add it to their README and don’t need to add it to the results repo

Again let me know if you run into any issues and we should probably add some docs on this. Thanks for raising the issue!

@Samoed
Copy link
Contributor

Samoed commented Sep 5, 2024

All tasks are already included in config, but I don't understand why tab only contains 2 datasets, because there are some results by embeddings-benchmark/results@319e81d. I'll try to update leaderboard

leaderboard/config.yaml

Lines 509 to 529 in fab8bb0

coir:
title: CoIR
language_long: "Code"
has_overall: false
acronym: null
icon: "💻"
special_icons: null
credits: "[Samoed](https://github.com/Samoed) and [monikernemo](https://github.com/monikernemo) and [CoIR (Xiangyang Li, Kuicai Dong, Yi Quan Lee et al.)](https://arxiv.org/abs/2407.02883)"
metric: nDCG@10
tasks:
Retrieval:
- AppsRetrieval
- CodeFeedbackMT
- CodeFeedbackST
- CodeSearchNetCCRetrieval
- CodeSearchNetRetrieval
- CodeTransOceanContest
- CodeTransOceanDL
- CosQA
- StackOverflowQA
- SyntheticText2SQL

@Samoed
Copy link
Contributor

Samoed commented Sep 5, 2024

Found issue with results repo embeddings-benchmark/results#23

@archersama
Copy link
Author

Found issue with results repo embeddings-benchmark/results#23

So the label error has been fixed? When will we see new dataset labels?

@KennethEnevoldsen
Copy link
Contributor

There was an update 6 hours ago, @orionw any idea where the bug could be here?

@Samoed
Copy link
Contributor

Samoed commented Sep 6, 2024

I don't think that this is bug, because results can't be updated correctly embeddings-benchmark/results#25

@KennethEnevoldsen
Copy link
Contributor

Thanks @Samoed for taking the time for this

@archersama
Copy link
Author

archersama commented Sep 9, 2024

I don't think that this is bug, because results can't be updated correctly embeddings-benchmark/results#25

The coir leaderboard error issue hasn't been fixed yet, is something wrong?

@Samoed
Copy link
Contributor

Samoed commented Sep 9, 2024

PR still under review

@archersama
Copy link
Author

PR still under review

Thanks @Samoed for taking the time for this,I see this pr embeddings-benchmark/results#25 has been merged. So CoIR leaderboard can be displayed normally?

@Samoed
Copy link
Contributor

Samoed commented Sep 11, 2024

Yes, I'm working on fixing some bugs in another PR and will update the leaderboard there. It will have all results from results repo

@Muennighoff
Copy link
Contributor

Maybe to prevent confusion in the future, we should add some documentation/instruction on how the results & leaderboard work and interoperate. It seems like @Samoed understands it really well - if you want to, it'd be amazing if you added some explanations to the READMEs to help others contribute more easily (https://github.com/embeddings-benchmark/leaderboard/blob/main/README.md & https://github.com/embeddings-benchmark/results/blob/main/README.md)

@archersama
Copy link
Author

Maybe to prevent confusion in the future, we should add some documentation/instruction on how the results & leaderboard work and interoperate. It seems like @Samoed understands it really well - if you want to, it'd be amazing if you added some explanations to the READMEs to help others contribute more easily (https://github.com/embeddings-benchmark/leaderboard/blob/main/README.md & https://github.com/embeddings-benchmark/results/blob/main/README.md)

I think so too. At the moment, it seems that some documents are outdated and cannot provide the appropriate guidance.

@yeliusf
Copy link

yeliusf commented Sep 17, 2024

Hi @archersama, I saw that CoIR has been integrated into MTEB. Thanks for the awesome work! I wonder why the results on MTEB are quite different from those reported in the paper or on the leaderboard.
For example:

  • E5-mistral-7b-instruct
    • Apps: Paper: 21.33 → MTEB: 23.46
    • CodeFeedbackMT: Paper: 33.65 → MTEB: 36.4
    • CodeFeedbackST: Paper: 72.71 → MTEB: 76.41
    • CodeTransOceanContest: Paper: 82.55 → MTEB: 88.58
      et al.

@Samoed
Copy link
Contributor

Samoed commented Sep 17, 2024

I think mostly this is because mteb includes titles to the prompt, but in the paper they don't include it

@yeliusf
Copy link

yeliusf commented Sep 17, 2024

Thanks, @Samoed. However, I noticed that most datasets in CoIR don't have titles. For example, in the 'apps' dataset, the title field is empty: https://huggingface.co/datasets/CoIR-Retrieval/apps-queries-corpus. Despite this, the performance improved from 21.33 in the paper to 23.46 in MTEB.

@KennethEnevoldsen
Copy link
Contributor

Just adding @monikernemo here as well.

Despite this, the performance improved from 21.33 in the paper to 23.46 in MTEB.

This seems odd. @Samoed any ideas about why this might happen?

@Samoed
Copy link
Contributor

Samoed commented Sep 18, 2024

Maybe the default task prompt can affect the results, because I don't know if the prompt was used in the original paper

@monikernemo
Copy link

monikernemo commented Sep 18, 2024

Besides model prompts, due to resource constraints, we could have had lowered model precision when churning the benchmarks for CoIR.

@yeliusf
Copy link

yeliusf commented Sep 18, 2024

Thanks @Samoed @monikernemo for the explanation.

Could you please share the prompt that you used?

I tried my prompt and evaluate the COIR with MTEB using E5-mistral model.
CoIR-Text-to-Code: "Given a text description or question, retrieve the complete code program that solves it",
CoIR-Code-to-Text: "Given a code snippet, retrieve the text that describes it",
CoIR-Code-to-Code: "Given a code snippet, retrieve relevant code to complete or enhance it",
CoIR-Hybrid: "Given a code snippet and a related text description or question, retrieve the relevant code or text that completes, explains, or enhances it"

Results Comparison of E5-Mistral:

Dataset APPS CosQA CSN-Go CSN-Java CSN-Javascript CSN-PHP CSN-Python CSN-Ruby
MTEB Report 23.46 33.1 93.06 84.08 80.93 83.14 91.75 85.37
Reproduced using prompt above 26.11 30.59 69.82 62.42 52.22 57.33 72.75 53.48

The main discrepancy is with CoIR CodeSearchNet, where I observe more than a 20% decrease in performance. I don't understand why this is happening. Could you please share the prompt and settings that you used?

@Samoed
Copy link
Contributor

Samoed commented Sep 18, 2024

@Muennighoff added results for e5-mistral. But I think that he run just mteb with default prompt https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/instructions.py#L17. You can add these prompts to mteb

@archersama
Copy link
Author

@Samoed Thank you for your efforts. I would like to inquire if there is a way to average the evaluation metrics of the subsets of CodeSearchNet and CodeSearchNet-CCR, and then merge them into the CodeSearchNet and CodeSearchNet-CCR datasets, similar to the COIR leaderboard.

@Samoed
Copy link
Contributor

Samoed commented Sep 23, 2024

There is one possible way, but it is a bit hacky and I don't know how it works with leaderboard. Currently, for CQADupstack tasks creates special aggregation task when results are exported, but this not exported to results directly as far as I know. Maybe possible to make something only with leaderboard, but I'm not sure

@archersama
Copy link
Author

There is one possible way, but it is a bit hacky and I don't know how it works with leaderboard. Currently, for CQADupstack tasks creates special aggregation task when results are exported, but this not exported to results directly as far as I know. Maybe possible to make something only with leaderboard, but I'm not sure

Perhaps we could consult with professionals, simply ensuring that they are correctly displayed on the leaderboard as results for the two datasets,@Muennighoff,@KennethEnevoldsen

@KennethEnevoldsen
Copy link
Contributor

Re scores of E5 Mistral:

E5 mistral is run using its implementation within MTEB and for the prompt they use the default prompts (as @Samoed says). So everything should be reproducible.

A good way to debug it is probably to test if the non-prompt models obtain the same performance (if that is not the case it is probably task implementation). After that, we can examine the smaller e5-large-instruct and see if the effect is due to the prompt.

Perhaps we could consult with professionals, simply ensuring that they are correctly displayed on the leaderboard as results for the two datasets,@Muennighoff,@KennethEnevoldsen

We do have aggregation for CQADupstack, this is for backward compatibility and I would prefer not to add more datasets using the same format.

I have suggested a potential approach to solve this issue here: embeddings-benchmark/mteb#1231

Perhaps we could consult with professionals

@Samoed is def. a professional, but none of us have the full overview of everything that goes on in the code.

@Samoed
Copy link
Contributor

Samoed commented Sep 24, 2024

I ran the COIR benchmark with e5-base-v2 and bge-m3 for some tasks (since others didn't fit in Kaggle's memory with batch size 8). The results for e5-base-v2 match the benchmark, but the results for bge-m3 are different.

results.zip

@yeliusf
Copy link

yeliusf commented Sep 24, 2024

Hi @Samoed,

Thanks for running the experiments for e5-base-v2 and bge-m3. Could you also evaluate the E5-mistral model?

I tried the new prompt you mentioned here, but the results I got are still different from the MTEB leaderboard:

Dataset CSN-Go CSN-Java CSN-Javascript CSN-PHP CSN-Python CSN-Ruby
MTEB Report 93.06 84.08 80.93 83.14 91.75 85.37
Reproduced using the new prompt 64.81 56.03 49.07 54.05 62.11 43.69

@Samoed
Copy link
Contributor

Samoed commented Sep 24, 2024

AppsRetrieval.json
CodeSearchNetRetrieval.json
COIRCodeSearchNetRetrieval.json
@yeliusf Here my results of mistral model on go. I think that you run COIRCodeSearchNet, while in leaderboard CodeSerachRetrieval

@yeliusf
Copy link

yeliusf commented Sep 24, 2024

Hi @Samoed,

I see, that makes sense. However, based on the COIR paper, they mentioned that their version of CodeSearchNet is focused on code summary retrieval, which differs from the original CodeSearchNet. Should we follow the COIR paper and rename the MTEB COIR leaderboard to COIRCodeSearchNetRetrieval rather than CodeSearchNetRetrieval and report COIRCodeSearchNetRetrieval score?

Could you also share your evaluation code? I noticed a gap between my results and your COIRCodeSearchNet score.

@archersama
Copy link
Author

@Samoed Can you help with aggregating the subtasks of codesearchnet as well as codesearchnet-ccr into one? And use the COIR format, that should be helpful to avoid confusion for everyone!

@Samoed
Copy link
Contributor

Samoed commented Sep 25, 2024

@archersama I found a way to implement this using resuls folder, but this won't work with self reported results. For more general solution of you problem you can comment in this issue I think embeddings-benchmark/mteb#1231

@archersama archersama closed this Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants