Leaderboard 2.0: Missing results #1571

x-tabdeveloping · 2024-12-09T11:19:58Z

Some pretty essential results seem to be missing from the new leaderboard.
Here's a list of things that we should probably fix before releasing the leaderboard:

MTEB(Multilingual)

I have only looked into models we promised to run in the review response, the problems might be more widespread

MTEB(eng, classic)

Problematic tasks:

CQADupstackProgrammersRetrieval (almost all models have nans)

Problematic models:

x-tabdeveloping · 2024-12-09T11:36:16Z

@Muennighoff Anything pops out to you as something that has been run before but is missing? Also, can you run some or all of these?

x-tabdeveloping · 2024-12-09T11:36:38Z

There are also some problematic tasks in MTEB(deu) and MTEB(fra), I will look into those too

x-tabdeveloping · 2024-12-09T12:41:09Z

MTEB(fra)

Quite a few individual results are missing for a lot of models, maybe something wrong with the scraping or data loading.

Problematic tasks:

MLSUMClusteringP2P almost all models missing

MTEB(deu)

Problematic tasks:

TenKGnadClusteringP2P **almost all models missing **

In both benchmarks we're missing some tasks from 7B models.

x-tabdeveloping · 2024-12-09T12:46:55Z

MTEB(eng, beta)

Problematic tasks:

HotpotQAHardNegatives majority of models missing
MedrxivClusteringS2S.v2 most notably: stella, gte, SFR, GritLM8x7b, bge, UAE
StackExchangeClusteringP2P.v2 almost all missing
SummEvalSummarization.v2 almost all missing
TwentyNewsgroupsClustering.v2 almost all missing

x-tabdeveloping · 2024-12-09T12:47:23Z

@Samoed @Muennighoff @KennethEnevoldsen I would really appreciate your help investigating and fixing these issues

Muennighoff · 2024-12-09T22:15:21Z

Great overview! Ofc can run anything that is missing if the model is loadable via mteb and ideally it's a python list of task & model pairs like here

mteb/scripts/running_model/create_slurm_jobs.py

Line 76 in e605c7b

model_names = [

x-tabdeveloping · 2024-12-10T10:29:51Z

I'll compile you a list

KennethEnevoldsen · 2024-12-10T19:00:27Z

we might also include this in the list: embeddings-benchmark/results#65 (review)

As a side note, we would like to run these models [list in issue] on the remaining tasks in the MTEB(Medical) benchmark. However, we initially held off due to API cost constraints. Do you have access to credits with these providers that we could use for this purpose? Alternatively, would it be possible for you to run them on your side?

(won't have too much time to look into this as I am at the neurips conference, but I will take a closer look once I get back)

x-tabdeveloping · 2024-12-16T10:02:36Z

Well, I have compiled a list of all task results that are missing for all benchmarks in the new leaderboard + all models that already show up in the leaderboard (have metadata and don't miss all results on a benchmark).
The list is very long, but luckily the most important models usually only miss one or two things. There also seem to be patterns, I'm wondering if it's got something to do with our versioning scheme for tasks.

I used the following script to get the missing results:

import json
from pathlib import Path

import pandas as pd
from tqdm import tqdm

import mteb
from mteb.leaderboard.table import scores_to_tables

benchmarks = mteb.get_benchmarks()

all_results = mteb.load_results()

results = {
    benchmark.name: benchmark.load_results(base_results=all_results)
    .join_revisions()
    .filter_models()
    for benchmark in tqdm(benchmarks, desc="Loading all benchmark results")
}


def to_pandas(gr_df) -> pd.DataFrame:
    cols = gr_df.value["headers"]
    data = gr_df.value["data"]
    return pd.DataFrame(data, columns=cols)


all_task_tables = {
    name: to_pandas(scores_to_tables(res.get_scores(format="long"))[1]).set_index(
        "Model"
    )
    for (name, res) in results.items()
}

missing_results = {}
for bench_name, table in all_task_tables.items():
    missing_results[bench_name] = {}
    for model_name, model_res in table.iterrows():
        nas = model_res.loc[model_res.isna()].index.to_list()
        if nas:
            missing_results[bench_name][model_name] = nas

And got this file:
missing_results.json

In the following format:

{
    "<benchmark_name>": {"<model_name>": ["task_name1", "task_name2", ...]}
}

We'll probably have to prioritize some models over others

x-tabdeveloping · 2024-12-16T13:24:33Z

Here it is for only the top 50 models, this should be a bit more reasonable to run:
missing_important.json

And here it is as a list of model-task pairs, as requested @Muennighoff :
missing_model_task_list_important.json

KennethEnevoldsen · 2024-12-22T22:15:39Z

(see embeddings-benchmark/results#80)

dbuades mentioned this issue Dec 10, 2024

feat: evaluate openai, cohere and google models in CUREv1 embeddings-benchmark/results#65

Merged

2 tasks

isaac-chung added the leaderboard issues related to the leaderboard label Dec 18, 2024

Samoed mentioned this issue Dec 21, 2024

fix: external download embeddings-benchmark/results#80

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaderboard 2.0: Missing results #1571

Leaderboard 2.0: Missing results #1571

x-tabdeveloping commented Dec 9, 2024

x-tabdeveloping commented Dec 9, 2024

x-tabdeveloping commented Dec 9, 2024

x-tabdeveloping commented Dec 9, 2024

x-tabdeveloping commented Dec 9, 2024

x-tabdeveloping commented Dec 9, 2024

Muennighoff commented Dec 9, 2024

x-tabdeveloping commented Dec 10, 2024

KennethEnevoldsen commented Dec 10, 2024

x-tabdeveloping commented Dec 16, 2024

x-tabdeveloping commented Dec 16, 2024

KennethEnevoldsen commented Dec 22, 2024

Leaderboard 2.0: Missing results #1571

Leaderboard 2.0: Missing results #1571

Comments

x-tabdeveloping commented Dec 9, 2024

MTEB(Multilingual)

MTEB(eng, classic)

Problematic tasks:

Problematic models:

x-tabdeveloping commented Dec 9, 2024

x-tabdeveloping commented Dec 9, 2024

x-tabdeveloping commented Dec 9, 2024

MTEB(fra)

Problematic tasks:

MTEB(deu)

Problematic tasks:

x-tabdeveloping commented Dec 9, 2024

MTEB(eng, beta)

Problematic tasks:

x-tabdeveloping commented Dec 9, 2024

Muennighoff commented Dec 9, 2024

x-tabdeveloping commented Dec 10, 2024

KennethEnevoldsen commented Dec 10, 2024

x-tabdeveloping commented Dec 16, 2024

x-tabdeveloping commented Dec 16, 2024

KennethEnevoldsen commented Dec 22, 2024