awesome-pretrained-models-for-information-retrieval

A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pretraining for IR). If there are any papers I missed, please let me know! And any feedback and contribution are welcome!

Pretraining for IR

Survey paper
Phase 1: First-stage retrieval
Phase 2: Re-ranking stage
Multimodal Retrieval
- Unified single-stream architecture
- Multi-stream architecture applied on input
Other Resources

We also include the recent Multimodal Pre-training works whose pre-trained models fine-tuned on the cross-modal retrieval tasks such as text-image retrieval in their experiments.

For people who want to acquire some basic&advanced knowledge about neural models for information retrieval and try some neural models by hand, we refer readers to the below awesome NeuIR survey and the text-matching toolkit MatchZoo-py:

A Deep Look into neural ranking models for information retrieval. Jiafeng Guo et.al.

Survey Paper

Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et.al.
Semantic Models for the First-stage Retrieval: A Comprehensive Review. Yinqiong Cai et.al.

First Stage Retrieval

Neural term weighting framework

Context-Aware Term Weighting For First Stage Passage Retrieval. Zhuyun Dai et.al. SIGIR 2020 short. [code] (DeepCT)
Context-Aware Document Term Weighting for Ad-Hoc Search. Zhuyun Dai et.al. WWW 2020. [code] (HDCT)
COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. Luyu Gao et.al. NAACL 2020. [code] (COIL)
Learning Passage Impacts for Inverted Indexes. Antonio Mallia et.al. SIGIR 2021 short. [code] (DeepImapct)

Document expansion for Sparse representation

Document Expansion by Query Prediction. Rodrigo Nogueira et.al. [doc2query code,docTTTTTquery code] (doc2query, docTTTTTquery)
SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. Yang Bai, Xiaoguang Li et.al. Arxiv 2020. (SparTerm: Term importance distribution from MLM+Binary Term Gating)

Decouple the dense representation encoding of query and document

Late interaction

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020. [code] (ColBERT)
Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. Sean MacAvaney et.al. SIGIR 2020. [code] (PreTTR)
Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau,Kurt Shuster et.al. ICLR 2020. [code] (Poly-encoders)
Modularized Transfomer-based Ranking Framework Luyu Gao et.al. EMNLP 2020. [code] (MORES, similar to Poly-encoders)

Negative sampling

RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et.al. Arxiv 2020. [code] (RepBERT, in-batch negatives)
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong et.al. [code] (ANCE, refresh index during training)
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et.al. Arxiv 2020. (RocketQA: cross-batch negatives, denoise hard negatives and data augementation)
Optimizing Dense Retrieval Model Training with Hard Negatives. Jingtao Zhan et.al. SIGIR 2021.[code] (ADORE&STAR, query-side finetuning build on pretrained document encoders)
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)

Knowledge distillation

Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard, Edouard Grave. ICLR 2020. [unofficial code] (Distill cross-attention of reader to retriever)
Distilling Knowledge for Fast Retrieval-based Chat-bots. Amir Vakili Tahami et.al. SIGIR 2020. [code] (Distill from cross-encoders to bi-encoders)
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. Sebastian Hofstätter et.al. Arxiv 2020. [code] (Distill from BERT ensemble)
Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin. Arxiv 2020. [code] (TCTColBERT: distill from ColBERT)
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)

Design pre-training tasks

Latent Retrieval for Weakly Supervised Open Domain Question Answering. Kenton Lee et.al. ACL 2019. [code] (ORQA, ICT)
Pre-training tasks for embedding-based large scale retrieva. Wei-Cheng Chang et.al. ICLR 2020. (ICT, BFS and WLP)
REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu, Kenton Lee et.al. ICML 2020. [code] (REALM)

Joint learn retrieval and index

Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index. Han Zhang et.al. SIGIR 2021 short. [code] (Poeem)
Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. Jingtao Zhan et.al. CIKM 2021. [code] (JPQ)
Efficient Passage Retrieval with Hashing for Open-domain Question Answering. Ikuya Yamada et.al. ACL 2021. [code] (BPR, convert embedding vector to binary codes)

Dense retrieval in open domain QA

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020 [code] (DPR)
Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. Jinhyuk Lee, Minjoon Seo et.al. ACL 2020. [code] (SPARC, sparse vectors)
DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang, Ping Nie et.al. SIGIR 2020 short. (DC-BERT)
Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ArxiV 2021. [code] (DensePhrases)

Re-ranking Stage

Pre-trained models for reranking

Straightforward applications

Passage Re-ranking with BERT. Rodrigo Nogueira et.al. [code] (monoBERT: Maybe the first work on applying BERT to IR)
Multi-Stage Document Ranking with BERT, The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. Rodrigo Nogueira et.al. Arxiv 2020. (Expando-Mono-Duo: doc2query+pointwise+pairwise)
CEDR: Contextualized Embeddings for Document Ranking. Sean MacAvaney et.al. SIGIR 2020 short. [code] (CEDR: BERT+neuIR model)

Process long documents

Deeper Text Understanding for IR with Contextual Neural Language Modeling. Zhuyun Dai et.al. SIGIR 2020 short. [code] (BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level)
Simple Applications of BERT for Ad Hoc Document Retrieval, Applying BERT to Document Retrieval with Birch, Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. Wei Yang, Haotian Zhang et.al. Arxiv 2020, Zeynep Akkalyoncu Yilmaz et.al. EMNLP 2019 short. [code] (Birch: Sentence-level)
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. Liu Yang et.al. CIKM 2020. [code] (SMITH for doc2doc matching)
Leveraging Passage-level Cumulative Gain for Document Ranking. Zhijing Wu et.al. WWW 2020. (PCGM)
PARADE: Passage Representation Aggregation for Document Reranking. Canjia Li et.al. Arxiv 2020. [code] (An extensive comparison of various Passage Representation Aggregation methods)
Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. Sebastian Hofstätter et.al. SIGIR 2021. [code] (Distill a ranking model to conv-knrm to select top-k passages)

Utilize generative pre-trained models

Beyond [CLS] through Ranking by Generation. Cicero Nogueira dos Santos et.al. EMNLP 2020 short. (query likelihood computed by GPT)
Document Ranking with a Pretrained Sequence-to-Sequence Model. Rodrigo Nogueira, Zhiying Jiang et.al. EMNLP 2020. [code] (using T5)
Generalizing Discriminative Retrieval Models using Generative Tasks. Bingsheng Liu, Hamed Zamani et.al. WWW 2021. (GDMTL,joint discriminative and generative model with multitask learning)

Efficient Training and query expansion

Training Curricula for Open Domain Answer Re-Ranking. Sean MacAvaney et.al. SIGIR 2020. [code] (curriculum learning based on BM25)
BERT-QE: Contextualized Query Expansion for Document Re-ranking. Zhi Zheng et.al. EMNLP 2020 Findings. [code] (BERT-QE)
Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models. Daniel Cohen et.al. SIGIR 2021.

Weak supervision and pre-training for reranking

MarkedBERT: Integrating Traditional IR Cues in Pre-trained Language Models for Passage Retrieval. Lila Boualili et.al. SIGIR 2020 short. [code] (MarkedBERT)
Selective Weak Supervision for Neural Information Retrieval. Kaitao Zhang et.al. WWW 2020. [code] (ReInfoSelect)
PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. WSDM 2021. [code] (PROP)
Cross-lingual Language Model Pretraining for Retrieval. Puxuan Yu et.al. WWW 2021.
B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. SIGIR 2021. [code] (B-PROP)
Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need. Zhengyi Ma et.al. CIKM 2021. [code] (HARP)
Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking. Yutao Zhu et.al. CIKM 2021. [code](COCA)
Pre-trained Language Model based Ranking in Baidu Search. Lixin Zou et.al. KDD 2021.
A Unified Pretraining Framework for Passage Ranking and Expansion. Ming Yan et.al. AAAI 2021. (UED, jointly training ranking and query generation)

Model acceleration

Local Self-Attention over Long Text for Efficient Document Retrieval. Sebastian Hofstätter et.al. SIGIR 2020 short. [code] (TKL:Transformer-Kernel for long text)
The Cascade Transformer: an Application for Efficient Answer Sentence Selection. Luca Soldaini et.al. ACL 2020.[code] (Cascade Transformer: prune candidates by layer)
Early Exiting BERT for Efficient Document Ranking. Ji Xin et.al. EMNLP 2020 SustaiNLP Workshop. [code] (Early exit)
Understanding BERT Rankers Under Distillation. Luyu Gao et.al. ICTIR 2020. (LM Distill + Ranker Distill)
Simplified TinyBERT: Knowledge Distillation for Document Retrieval. Xuanang Chen et.al. ECIR 2021. [code] (TinyBERT+knowledge distillation)
TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. Shengyao Zhuang, Guido Zuccon SIGIR 2021. [code] (TILDE)

Cross-lingual retrieval

Cross-lingual Retrieval for Iterative Self-Supervised Training. Chau Tran et.al. NIPS 2020. [code] (CRISS)
CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. Shuo Sun et.al. EMNLP 2020. [code] (Multilingual dataset-CLIRMatrix and multilingual BERT)

Multimodal Retrieval

Unified Single-stream Architecture

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. Gen Li, Nan Duan et.al. AAAI 2020. [code] (Unicoder-VL)
XGPT: Cross-modal Generative Pre-Training for Image Captioning. Qiaolin Xia, Haoyang Huang, Nan Duan et.al. Arxiv 2020. [code] (XGPT)
UNITER: UNiversal Image-TExt Representation Learning. Yen-Chun Chen, Linjie Li et.al. ECCV 2020. [code] (UNITER)
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Xiujun Li, Xi Yin et.al. ECCV 2020. [code] (Oscar)
VinVL: Making Visual Representations Matter in Vision-Language Models. Pengchuan Zhang, Xiujun Li et.al. ECCV 2020. [code] (VinVL)
Dynamic Modality Interaction Modeling for Image-Text Retrieval. Leigang Qu et.al. SIGIR 2021 Best student paper. [code] (DIME)

Multi-stream Architecture Applied on Input

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra et.al. NeurIPS 2019. [code] (VilBERT)
12-in-1: Multi-Task Vision and Language Representation Learning. Jiasen Lu, Dhruv Batra et.al. CVPR 2020. [code] (A multi-task model based on VilBERT)
Learning Transferable Visual Models From Natural Language Supervision. Alec Radford et.al. CVPR 2020. [code] (CLIP, GPT team)
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Fei Yu, Jiji Tang et.al. Arxiv 2020. [code] (ERNIE-ViL，1st place on the VCR leaderboard)
M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. Junyang Lin, An Yang et.al. KDD 2020. (M6-v0/InterBERT)
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. Haoyang Huang, Lin Su et.al. CVPR 2021. [code] (M3P, MILD dataset)

Other Resources

Some Retrieval Toolkits

Faiss: a library for efficient similarity search and clustering of dense vectors
Pyserini: a Python Toolkit to Support Sparse and Dense Representations
MatchZoo: a library consisting of many popular neural text matching models

Other Resources About Pre-trained Models in NLP

Pre-trained Models for Natural Language Processing: A Survey. Xipeng Qiu et.al.
BERT-related-papers
Pre-trained Languge Model Papers from THU-NLP

Surveys About Efficient Transformers

Efficient Transformers: A Survey. Yi Tay, Mostafa Dehghani et.al. Arxiv 2020.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

awesome-pretrained-models-for-information-retrieval

Pretraining for IR

Survey Paper

First Stage Retrieval

Neural term weighting framework

Document expansion for Sparse representation

Decouple the dense representation encoding of query and document

Late interaction

Negative sampling

Knowledge distillation

Design pre-training tasks

Joint learn retrieval and index

Dense retrieval in open domain QA

Re-ranking Stage

Pre-trained models for reranking

Straightforward applications

Process long documents

Utilize generative pre-trained models

Efficient Training and query expansion

Weak supervision and pre-training for reranking

Model acceleration

Cross-lingual retrieval

Multimodal Retrieval

Unified Single-stream Architecture

Multi-stream Architecture Applied on Input

Other Resources

Some Retrieval Toolkits

Other Resources About Pre-trained Models in NLP

Surveys About Efficient Transformers

Files

README.md

Latest commit

History

README.md

File metadata and controls

awesome-pretrained-models-for-information-retrieval

Pretraining for IR

Survey Paper

First Stage Retrieval

Neural term weighting framework

Document expansion for Sparse representation

Decouple the dense representation encoding of query and document

Late interaction

Negative sampling

Knowledge distillation

Design pre-training tasks

Joint learn retrieval and index

Dense retrieval in open domain QA

Re-ranking Stage

Pre-trained models for reranking

Straightforward applications

Process long documents

Utilize generative pre-trained models

Efficient Training and query expansion

Weak supervision and pre-training for reranking

Model acceleration

Cross-lingual retrieval

Multimodal Retrieval

Unified Single-stream Architecture

Multi-stream Architecture Applied on Input

Other Resources

Some Retrieval Toolkits

Other Resources About Pre-trained Models in NLP

Surveys About Efficient Transformers