-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update COIR default.jsonl #27
Conversation
Hmm I am not too familiar with the leaderboard, but this seems like a manual input of the results (@Muennighoff). Shouldn't this be derived from the results repository? |
I noticed that CoIR was incorrectly labeled with only two datasets, I changed the label to include all ten datasets so that it now shows the correct datasets. I additionally uploaded all the results manually. So I should upload the model performance to the results repository? |
Yes. And after that you should delete |
But shouldn't it be specified in the config and then be auto-updated (@orionw adding you here well since you have more experience with the leaderboard). Following the instruction for a leaderboard tab, it doesn't seem like this is the intended approach. It might be that the docs are outdated. |
Thanks @KennethEnevoldsen! Yes this file will get over-written when the nightly CI occurs. It is definitely possible that some docs are outdated, sorry about that!
@archersama this should be done in the config that Kenneth referenced, let me know if you have any troubles with it!
Yes, uploading them to the results repository would be the way to go! Or if the results are attached to models you control on HF, you can add it to their README and don’t need to add it to the results repo Again let me know if you run into any issues and we should probably add some docs on this. Thanks for raising the issue! |
All tasks are already included in config, but I don't understand why tab only contains 2 datasets, because there are some results by embeddings-benchmark/results@319e81d. I'll try to update leaderboard Lines 509 to 529 in fab8bb0
|
Found issue with results repo embeddings-benchmark/results#23 |
So the label error has been fixed? When will we see new dataset labels? |
There was an update 6 hours ago, @orionw any idea where the bug could be here? |
I don't think that this is bug, because results can't be updated correctly embeddings-benchmark/results#25 |
Thanks @Samoed for taking the time for this |
The coir leaderboard error issue hasn't been fixed yet, is something wrong? |
PR still under review |
Thanks @Samoed for taking the time for this,I see this pr embeddings-benchmark/results#25 has been merged. So CoIR leaderboard can be displayed normally? |
Yes, I'm working on fixing some bugs in another PR and will update the leaderboard there. It will have all results from results repo |
Maybe to prevent confusion in the future, we should add some documentation/instruction on how the results & leaderboard work and interoperate. It seems like @Samoed understands it really well - if you want to, it'd be amazing if you added some explanations to the READMEs to help others contribute more easily (https://github.com/embeddings-benchmark/leaderboard/blob/main/README.md & https://github.com/embeddings-benchmark/results/blob/main/README.md) |
I think so too. At the moment, it seems that some documents are outdated and cannot provide the appropriate guidance. |
Hi @archersama, I saw that CoIR has been integrated into MTEB. Thanks for the awesome work! I wonder why the results on MTEB are quite different from those reported in the paper or on the leaderboard.
|
I think mostly this is because mteb includes titles to the prompt, but in the paper they don't include it |
Thanks, @Samoed. However, I noticed that most datasets in CoIR don't have titles. For example, in the 'apps' dataset, the title field is empty: https://huggingface.co/datasets/CoIR-Retrieval/apps-queries-corpus. Despite this, the performance improved from 21.33 in the paper to 23.46 in MTEB. |
Just adding @monikernemo here as well.
This seems odd. @Samoed any ideas about why this might happen? |
Maybe the default task prompt can affect the results, because I don't know if the prompt was used in the original paper |
Besides model prompts, due to resource constraints, we could have had lowered model precision when churning the benchmarks for CoIR. |
Thanks @Samoed @monikernemo for the explanation. Could you please share the prompt that you used? I tried my prompt and evaluate the COIR with MTEB using E5-mistral model. Results Comparison of E5-Mistral:
The main discrepancy is with CoIR CodeSearchNet, where I observe more than a 20% decrease in performance. I don't understand why this is happening. Could you please share the prompt and settings that you used? |
@Muennighoff added results for e5-mistral. But I think that he run just mteb with default prompt https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/instructions.py#L17. You can add these prompts to mteb |
@Samoed Thank you for your efforts. I would like to inquire if there is a way to average the evaluation metrics of the subsets of CodeSearchNet and CodeSearchNet-CCR, and then merge them into the CodeSearchNet and CodeSearchNet-CCR datasets, similar to the COIR leaderboard. |
There is one possible way, but it is a bit hacky and I don't know how it works with leaderboard. Currently, for CQADupstack tasks creates special aggregation task when results are exported, but this not exported to results directly as far as I know. Maybe possible to make something only with leaderboard, but I'm not sure |
Perhaps we could consult with professionals, simply ensuring that they are correctly displayed on the leaderboard as results for the two datasets,@Muennighoff,@KennethEnevoldsen |
Re scores of E5 Mistral: E5 mistral is run using its implementation within MTEB and for the prompt they use the default prompts (as @Samoed says). So everything should be reproducible. A good way to debug it is probably to test if the non-prompt models obtain the same performance (if that is not the case it is probably task implementation). After that, we can examine the smaller e5-large-instruct and see if the effect is due to the prompt.
We do have aggregation for CQADupstack, this is for backward compatibility and I would prefer not to add more datasets using the same format. I have suggested a potential approach to solve this issue here: embeddings-benchmark/mteb#1231
@Samoed is def. a professional, but none of us have the full overview of everything that goes on in the code. |
I ran the COIR benchmark with |
Hi @Samoed, Thanks for running the experiments for I tried the new prompt you mentioned here, but the results I got are still different from the MTEB leaderboard:
|
AppsRetrieval.json |
Hi @Samoed, I see, that makes sense. However, based on the COIR paper, they mentioned that their version of CodeSearchNet is focused on code summary retrieval, which differs from the original CodeSearchNet. Should we follow the COIR paper and rename the MTEB COIR leaderboard to COIRCodeSearchNetRetrieval rather than CodeSearchNetRetrieval and report COIRCodeSearchNetRetrieval score? Could you also share your evaluation code? I noticed a gap between my results and your COIRCodeSearchNet score. |
@Samoed Can you help with aggregating the subtasks of codesearchnet as well as codesearchnet-ccr into one? And use the COIR format, that should be helpful to avoid confusion for everyone! |
@archersama I found a way to implement this using resuls folder, but this won't work with self reported results. For more general solution of you problem you can comment in this issue I think embeddings-benchmark/mteb#1231 |
I have updated the CoIR leaderboard scores on the MTEB benchmark. Additionally, I have executed the
apps.py
script to ensure the content is displayed correctly.