In this work we find that while rankers have witnessed an impresive improvement in the oeprformance during the recent years, there are still a significant number of queries that cannot be addressed by any of the state of the art neural rankers. We refer to these queries as obstinate queries because of their difficulty. This means that regardless of the neural ranker, these queries will not see any performance improvements and the increase in overall performance reported by the ranker are due to improvements on another selected subset of queries. We believe that careful treatment on these queries will lead to the a more stable and consistent performance of neural rankers across all the queries.
Please find more details on the paper: MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries (CIKM 2021)
We investigate the performance of SOTA rankers on MSMARCO small dev set which contains 6980 queries. We noticed no matter which baseline method is considered, whether it be a traditional BM25 ranker or a complex neural ranker, there is a noticeable number of queries for which the rankers are unable to return any reasonable ranking. Further there is a noticeable number of poorly performed queries that are in common acroos all the rankers. Table 1 illustrates the performance of the 'difficult' queries qhich are among least 50% performance of each baseline and are in common in 4,5 and 6 of SOTA rankers
Table 1. : MAP Performance of the rankers on 50% hardest queries of the Chameleon datasets.
Variations | Dataset Name | Number of Queries | BM25 | DeepCT | DocT5Query | RepBert | ANCE | TCT-Colbert |
---|---|---|---|---|---|---|---|---|
Common in 6 rankers |
Lesser Chameleon |
1693 | 0.0066 (Run) |
0.0122 (Run) |
0.0185 (Run) |
0.0212 (Run) |
0.0286 (Run) |
0.0267 (Run) |
Common in 5 rankers |
Pygmy Chameleon |
2473 | 0.0215 (Run) |
0.0240 (Run) |
0.0403 (Run) |
0.0398 (Run) |
0.0546 (Run) |
0.0462 (Run) |
Common in 4 rankers |
Veiled Chameleon |
3119 | 0.0392 (Run) |
0.0400 (Run) |
0.0660 (Run) |
0.0560 (Run) |
0.0847 (Run) |
0.0780 (Run) |
We made all the runs available in the Chameleons Google drive.
You can find the details of implementation of each method here.
Furthermore, given the literature has reported that hard queries can often be due to issues such as vocabulary mismatch, and hence can be improved through query reformulation, we report the performance of several strong query reformulation techniques on the MSMarco Chameleons dataset and show that such queries remain stubborn and do not report noticeable performance improvements even after systematic reformulation.
The expanded queries can be found here which are implemented using ReQue toolkit.
Map pn the 50% Outstalet's dataset | |||||||
---|---|---|---|---|---|---|---|
Category | query | BM25 | DeepCT | DocT5 | RepBert | ANCE | TCT-ColBert |
Psuedo-Relevance Feedback |
|||||||
Relevance feedback | 0.0477 (query) | 0.0574 (query) | 0.0566 (query) | 0.0513 (query) | 0.0277 (query) | 0.0693 (query) | |
RM3 | 0.0407 (query) | 0.0375 (query) | 0.0603 (query) | 0.0459 (query) | 0.0374 (query) | 0.0610 (query) | |
Document clustering | 0.0392 (query) | 0.0393 (query) | 0.0593 (query) | 0.0550 (query) | 0.0609 (query) | 0.0765 (query) | |
Term Clustering | 0.0412 (query) | 0.0424 (query) | 0.0567 (query) | 0.0557 (query) | 0.0693 (query) | 0.0724 (query) | |
External Sources |
|||||||
Neural Embeddings (query) | 0.0218 | 0.0248 | 0.0285 | 0.0409 | 0.0468 | 0.0462 | |
WikiPedia (query) | 0.0277 | 0.0313 | 0.0341 | 0.0368 | 0.0466 | 0.0396 | |
Thesaurus (query) | 0.0277 | 0.0313 | 0.0341 | 0.0368 | 0.0466 | 0.0396 | |
Entity Linking (query) | 0.0399 | 0.0450 | 0.0543 | 0.0507 | 0.0533 | 0.0649 | |
Sense Disambigution (query) | 0.0359 | 0.0360 | 0.0521 | 0.0512 | 0.0653 | 0.0633 | |
ConceptNet (query) | 0..0269 | 0.0278 | 0.0342 | 0.0369 | 0.0488 | 0.0442 | |
WordNet (query) | 0.0271 | 0.0569 | 0.0346 | 0.0359 | 0.0399 | 0.0406 | |
Supervised Approaches |
|||||||
ANMT (Seq2Seq) (query) | 0.0002 | 0.0007 | 0.0010 | 0.0020 | 0.0046 | 0.0066 | |
ACG (Seq2Seq + Attention) (query) | 0.0240 | 0.0307 | 0.0359 | 0.0433 | 0.0450 | 0.0470 | |
HRED-qs (query) | 0.006 | 0.002 | 0.003 | 0.0060 | 0.0082 | 0.0110 |
It should be noted that the psuedo-relevance feedback-based query expansion methods are different for each run since they are based on initial first round of retrieval. However, for other methods, the expansion queries are the same.
Please cite our work as:
@inproceedings{arabzadehcikm2021-3,
author = {Negar Arabzadeh and Bhaskar Mitra and Ebrahim Bagheri,},
title = {MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries},
booktitle = {The 30th ACM Conference on Information and Knowledge Management (CIKM 2021)},
year = {2021}
}
Negar Arabzadeh, Bhaskar Mitra and Ebrahim Bagheri
Laboratory for Systems, Software and Semantics (LS3), Ryerson University, ON, Canada.