Update COIR default.jsonl #27

archersama · 2024-09-03T12:20:48Z

I have updated the CoIR leaderboard scores on the MTEB benchmark. Additionally, I have executed the apps.py script to ensure the content is displayed correctly.

KennethEnevoldsen · 2024-09-05T08:45:38Z

Hmm I am not too familiar with the leaderboard, but this seems like a manual input of the results (@Muennighoff). Shouldn't this be derived from the results repository?

archersama · 2024-09-05T09:11:03Z

Hmm I am not too familiar with the leaderboard, but this seems like a manual input of the results (@Muennighoff). Shouldn't this be derived from the results repository?

I noticed that CoIR was incorrectly labeled with only two datasets, I changed the label to include all ten datasets so that it now shows the correct datasets. I additionally uploaded all the results manually. So I should upload the model performance to the results repository?

Samoed · 2024-09-05T10:07:06Z

Yes. And after that you should delete EXTERNAL_MODEL_RESULTS.json if model presented here and run refresh.py

KennethEnevoldsen · 2024-09-05T13:06:35Z

But shouldn't it be specified in the config and then be auto-updated (@orionw adding you here well since you have more experience with the leaderboard).

Following the instruction for a leaderboard tab, it doesn't seem like this is the intended approach. It might be that the docs are outdated.

orionw · 2024-09-05T13:39:27Z

Thanks @KennethEnevoldsen! Yes this file will get over-written when the nightly CI occurs.

It is definitely possible that some docs are outdated, sorry about that!

noticed that CoIR was incorrectly labeled with only two datasets, I changed the label to include all ten datasets so that it now shows the correct datasets.

@archersama this should be done in the config that Kenneth referenced, let me know if you have any troubles with it!

I additionally uploaded all the results manually. So I should upload the model performance to the results repository?

Yes, uploading them to the results repository would be the way to go! Or if the results are attached to models you control on HF, you can add it to their README and don’t need to add it to the results repo

Again let me know if you run into any issues and we should probably add some docs on this. Thanks for raising the issue!

Samoed · 2024-09-05T13:52:17Z

All tasks are already included in config, but I don't understand why tab only contains 2 datasets, because there are some results by embeddings-benchmark/results@319e81d. I'll try to update leaderboard

leaderboard/config.yaml

Lines 509 to 529 in fab8bb0

    
           coir: 
        
             title: CoIR 
        
             language_long: "Code" 
        
             has_overall: false 
        
             acronym: null 
        
             icon: "💻" 
        
             special_icons: null 
        
             credits: "[Samoed](https://github.com/Samoed) and [monikernemo](https://github.com/monikernemo) and [CoIR (Xiangyang Li, Kuicai Dong, Yi Quan Lee et al.)](https://arxiv.org/abs/2407.02883)" 
        
             metric: nDCG@10 
        
             tasks: 
        
               Retrieval: 
        
                 - AppsRetrieval 
        
                 - CodeFeedbackMT 
        
                 - CodeFeedbackST 
        
                 - CodeSearchNetCCRetrieval 
        
                 - CodeSearchNetRetrieval 
        
                 - CodeTransOceanContest 
        
                 - CodeTransOceanDL 
        
                 - CosQA 
        
                 - StackOverflowQA 
        
                 - SyntheticText2SQL

Samoed · 2024-09-05T14:00:53Z

Found issue with results repo embeddings-benchmark/results#23

archersama · 2024-09-06T06:11:08Z

Found issue with results repo embeddings-benchmark/results#23

So the label error has been fixed? When will we see new dataset labels?

KennethEnevoldsen · 2024-09-06T09:51:05Z

There was an update 6 hours ago, @orionw any idea where the bug could be here?

Samoed · 2024-09-06T09:59:00Z

I don't think that this is bug, because results can't be updated correctly embeddings-benchmark/results#25

KennethEnevoldsen · 2024-09-06T10:49:14Z

Thanks @Samoed for taking the time for this

archersama · 2024-09-09T03:44:29Z

I don't think that this is bug, because results can't be updated correctly embeddings-benchmark/results#25

The coir leaderboard error issue hasn't been fixed yet, is something wrong?

Samoed · 2024-09-09T06:06:47Z

PR still under review

archersama · 2024-09-11T08:13:16Z

PR still under review

Thanks @Samoed for taking the time for this，I see this pr embeddings-benchmark/results#25 has been merged. So CoIR leaderboard can be displayed normally？

Samoed · 2024-09-11T08:19:17Z

Yes, I'm working on fixing some bugs in another PR and will update the leaderboard there. It will have all results from results repo

Muennighoff · 2024-09-11T15:18:39Z

Maybe to prevent confusion in the future, we should add some documentation/instruction on how the results & leaderboard work and interoperate. It seems like @Samoed understands it really well - if you want to, it'd be amazing if you added some explanations to the READMEs to help others contribute more easily (https://github.com/embeddings-benchmark/leaderboard/blob/main/README.md & https://github.com/embeddings-benchmark/results/blob/main/README.md)

archersama · 2024-09-12T02:56:34Z

Maybe to prevent confusion in the future, we should add some documentation/instruction on how the results & leaderboard work and interoperate. It seems like @Samoed understands it really well - if you want to, it'd be amazing if you added some explanations to the READMEs to help others contribute more easily (https://github.com/embeddings-benchmark/leaderboard/blob/main/README.md & https://github.com/embeddings-benchmark/results/blob/main/README.md)

I think so too. At the moment, it seems that some documents are outdated and cannot provide the appropriate guidance.

yeliusf · 2024-09-17T17:49:47Z

Hi @archersama, I saw that CoIR has been integrated into MTEB. Thanks for the awesome work! I wonder why the results on MTEB are quite different from those reported in the paper or on the leaderboard.
For example:

E5-mistral-7b-instruct
- Apps: Paper: 21.33 → MTEB: 23.46
- CodeFeedbackMT: Paper: 33.65 → MTEB: 36.4
- CodeFeedbackST: Paper: 72.71 → MTEB: 76.41
- CodeTransOceanContest: Paper: 82.55 → MTEB: 88.58
  et al.

Samoed · 2024-09-17T18:31:19Z

I think mostly this is because mteb includes titles to the prompt, but in the paper they don't include it

yeliusf · 2024-09-17T23:19:27Z

Thanks, @Samoed. However, I noticed that most datasets in CoIR don't have titles. For example, in the 'apps' dataset, the title field is empty: https://huggingface.co/datasets/CoIR-Retrieval/apps-queries-corpus. Despite this, the performance improved from 21.33 in the paper to 23.46 in MTEB.

KennethEnevoldsen · 2024-09-18T08:02:49Z

Just adding @monikernemo here as well.

Despite this, the performance improved from 21.33 in the paper to 23.46 in MTEB.

This seems odd. @Samoed any ideas about why this might happen?

Samoed · 2024-09-18T08:20:54Z

Maybe the default task prompt can affect the results, because I don't know if the prompt was used in the original paper

monikernemo · 2024-09-18T10:45:46Z

Besides model prompts, due to resource constraints, we could have had lowered model precision when churning the benchmarks for CoIR.

yeliusf · 2024-09-18T18:28:58Z

Thanks @Samoed @monikernemo for the explanation.

Could you please share the prompt that you used?

I tried my prompt and evaluate the COIR with MTEB using E5-mistral model.
CoIR-Text-to-Code: "Given a text description or question, retrieve the complete code program that solves it",
CoIR-Code-to-Text: "Given a code snippet, retrieve the text that describes it",
CoIR-Code-to-Code: "Given a code snippet, retrieve relevant code to complete or enhance it",
CoIR-Hybrid: "Given a code snippet and a related text description or question, retrieve the relevant code or text that completes, explains, or enhances it"

Results Comparison of E5-Mistral:

Dataset	APPS	CosQA	CSN-Go	CSN-Java	CSN-Javascript	CSN-PHP	CSN-Python	CSN-Ruby
MTEB Report	23.46	33.1	93.06	84.08	80.93	83.14	91.75	85.37
Reproduced using prompt above	26.11	30.59	69.82	62.42	52.22	57.33	72.75	53.48

The main discrepancy is with CoIR CodeSearchNet, where I observe more than a 20% decrease in performance. I don't understand why this is happening. Could you please share the prompt and settings that you used?

Samoed · 2024-09-18T18:36:07Z

@Muennighoff added results for e5-mistral. But I think that he run just mteb with default prompt https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/instructions.py#L17. You can add these prompts to mteb

archersama · 2024-09-23T09:40:05Z

@Samoed Thank you for your efforts. I would like to inquire if there is a way to average the evaluation metrics of the subsets of CodeSearchNet and CodeSearchNet-CCR, and then merge them into the CodeSearchNet and CodeSearchNet-CCR datasets, similar to the COIR leaderboard.

Samoed · 2024-09-23T09:46:57Z

There is one possible way, but it is a bit hacky and I don't know how it works with leaderboard. Currently, for CQADupstack tasks creates special aggregation task when results are exported, but this not exported to results directly as far as I know. Maybe possible to make something only with leaderboard, but I'm not sure

archersama · 2024-09-23T09:54:59Z

There is one possible way, but it is a bit hacky and I don't know how it works with leaderboard. Currently, for CQADupstack tasks creates special aggregation task when results are exported, but this not exported to results directly as far as I know. Maybe possible to make something only with leaderboard, but I'm not sure

Perhaps we could consult with professionals, simply ensuring that they are correctly displayed on the leaderboard as results for the two datasets,@Muennighoff,@KennethEnevoldsen

KennethEnevoldsen · 2024-09-23T10:53:55Z

Re scores of E5 Mistral:

E5 mistral is run using its implementation within MTEB and for the prompt they use the default prompts (as @Samoed says). So everything should be reproducible.

A good way to debug it is probably to test if the non-prompt models obtain the same performance (if that is not the case it is probably task implementation). After that, we can examine the smaller e5-large-instruct and see if the effect is due to the prompt.

Perhaps we could consult with professionals, simply ensuring that they are correctly displayed on the leaderboard as results for the two datasets,@Muennighoff,@KennethEnevoldsen

We do have aggregation for CQADupstack, this is for backward compatibility and I would prefer not to add more datasets using the same format.

I have suggested a potential approach to solve this issue here: embeddings-benchmark/mteb#1231

Perhaps we could consult with professionals

@Samoed is def. a professional, but none of us have the full overview of everything that goes on in the code.

Samoed · 2024-09-24T08:33:04Z

I ran the COIR benchmark with e5-base-v2 and bge-m3 for some tasks (since others didn't fit in Kaggle's memory with batch size 8). The results for e5-base-v2 match the benchmark, but the results for bge-m3 are different.

results.zip

yeliusf · 2024-09-24T17:44:25Z

Hi @Samoed,

Thanks for running the experiments for e5-base-v2 and bge-m3. Could you also evaluate the E5-mistral model?

I tried the new prompt you mentioned here, but the results I got are still different from the MTEB leaderboard:

Dataset	CSN-Go	CSN-Java	CSN-Javascript	CSN-PHP	CSN-Python	CSN-Ruby
MTEB Report	93.06	84.08	80.93	83.14	91.75	85.37
Reproduced using the new prompt	64.81	56.03	49.07	54.05	62.11	43.69

Samoed · 2024-09-24T21:03:32Z

AppsRetrieval.json
CodeSearchNetRetrieval.json
COIRCodeSearchNetRetrieval.json
@yeliusf Here my results of mistral model on go. I think that you run COIRCodeSearchNet, while in leaderboard CodeSerachRetrieval

yeliusf · 2024-09-24T21:38:50Z

Hi @Samoed,

I see, that makes sense. However, based on the COIR paper, they mentioned that their version of CodeSearchNet is focused on code summary retrieval, which differs from the original CodeSearchNet. Should we follow the COIR paper and rename the MTEB COIR leaderboard to COIRCodeSearchNetRetrieval rather than CodeSearchNetRetrieval and report COIRCodeSearchNetRetrieval score?

Could you also share your evaluation code? I noticed a gap between my results and your COIRCodeSearchNet score.

archersama · 2024-09-25T09:36:49Z

@Samoed Can you help with aggregating the subtasks of codesearchnet as well as codesearchnet-ccr into one? And use the COIR format, that should be helpful to avoid confusion for everyone!

Samoed · 2024-09-25T10:01:50Z

@archersama I found a way to implement this using resuls folder, but this won't work with self reported results. For more general solution of you problem you can comment in this issue I think embeddings-benchmark/mteb#1231

Update COIR default.jsonl

0d968d7

KennethEnevoldsen requested a review from Muennighoff September 5, 2024 08:45

Samoed mentioned this pull request Sep 11, 2024

Fix leaderboard metrics and COIR tasks #26

Merged

KennethEnevoldsen mentioned this pull request Sep 23, 2024

Allow aggregated tasks within benchmarks embeddings-benchmark/mteb#1231

Open

archersama closed this Sep 29, 2024

Update COIR default.jsonl #27

Update COIR default.jsonl #27

Conversation

archersama commented Sep 3, 2024

KennethEnevoldsen commented Sep 5, 2024 • edited Loading

archersama commented Sep 5, 2024

Samoed commented Sep 5, 2024

KennethEnevoldsen commented Sep 5, 2024

orionw commented Sep 5, 2024

Samoed commented Sep 5, 2024

Samoed commented Sep 5, 2024

archersama commented Sep 6, 2024

KennethEnevoldsen commented Sep 6, 2024

Samoed commented Sep 6, 2024

KennethEnevoldsen commented Sep 6, 2024

archersama commented Sep 9, 2024 • edited Loading

Samoed commented Sep 9, 2024

archersama commented Sep 11, 2024

Samoed commented Sep 11, 2024

Muennighoff commented Sep 11, 2024

archersama commented Sep 12, 2024

yeliusf commented Sep 17, 2024

Samoed commented Sep 17, 2024

yeliusf commented Sep 17, 2024 • edited Loading

KennethEnevoldsen commented Sep 18, 2024

Samoed commented Sep 18, 2024

monikernemo commented Sep 18, 2024 • edited Loading

yeliusf commented Sep 18, 2024 • edited Loading

Samoed commented Sep 18, 2024 • edited Loading

archersama commented Sep 23, 2024

Samoed commented Sep 23, 2024

archersama commented Sep 23, 2024

KennethEnevoldsen commented Sep 23, 2024

Samoed commented Sep 24, 2024

yeliusf commented Sep 24, 2024 • edited Loading

Samoed commented Sep 24, 2024 • edited Loading

yeliusf commented Sep 24, 2024 • edited Loading

archersama commented Sep 25, 2024

Samoed commented Sep 25, 2024

KennethEnevoldsen commented Sep 5, 2024 •

edited

Loading

archersama commented Sep 9, 2024 •

edited

Loading

yeliusf commented Sep 17, 2024 •

edited

Loading

monikernemo commented Sep 18, 2024 •

edited

Loading

yeliusf commented Sep 18, 2024 •

edited

Loading

Samoed commented Sep 18, 2024 •

edited

Loading

yeliusf commented Sep 24, 2024 •

edited

Loading

Samoed commented Sep 24, 2024 •

edited

Loading

yeliusf commented Sep 24, 2024 •

edited

Loading