awesome-pretrained-models-for-information-retrieval

A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pre-training for IR). If I missed any papers, feel free to open a PR to include them! And any feedback and contributions are welcome!

Pre-training for IR

Survey Papers
Phase 1: First-stage Retrieval
Sparse Retrieval
Dense Retrieval
Hybrid Retrieval
Phase 2: Re-ranking Stage
Basic Usage
Long Document Processing Techniques
Improving Efficiency
Other Topics
Jointly Learning Retrieval and Re-ranking
Model-based IR System
LLM and IR

Retrieval Augmented LLM
LLM for IR
Multimodal Retrieval

Unified Single-stream Architecture

Multi-stream Architecture Applied on Input
Other Resources

Survey Papers

Pre-training Methods in Information Retrieval. Yixing Fan, Xiaohui Xie et.al. FnTIR 2022
Dense Text Retrieval based on Pretrained Language Models: A Survey. Wayne Xin Zhao, Jing Liu et.al. Arxiv 2022
Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et.al. M&C 2021
Semantic Models for the First-stage Retrieval: A Comprehensive Review. Jiafeng Guo et.al. TOIS 2021
A Deep Look into neural ranking models for information retrieval. Jiafeng Guo et.al. IPM 2020

First Stage Retrieval

Sparse Retrieval

Dense Retrieval

Hard negative sampling

Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020 [code] (DPR, in-batch negatives)
RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et.al. Arxiv 2020. [code] (RepBERT)
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong et.al. [code] (ANCE, refresh index during training)
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et.al. NAACL 2021. (RocketQA: cross-batch negatives, denoise hard negatives and data augementation)
Optimizing Dense Retrieval Model Training with Hard Negatives. Jingtao Zhan et.al. SIGIR 2021.[code] (ADORE&STAR, query-side finetuning build on pretrained document encoders)
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval Ruiyang Ren et.al. EMNLP Findings 2021. [code] (PAIR)

Late interaction and multi-vector representation

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020. [code] (ColBERT)
Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau,Kurt Shuster et.al. ICLR 2020. [code] (Poly-encoders)
Sparse, Dense, and Attentional Representations for Text Retrieval. Yi Luan, Jacob Eisenstein et.al. TACL 2020. (ME-BERT, multi-vectors)
Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval. Hongyin Tang, Xingwu Sun et.al. ACL 2021.
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ACL 2021. [code] (DensePhrases)
Multi-View Document Representation Learning for Open-Domain Dense Retrieval. Shunyu Zhang et.al. ACL 2022. (MVR)
Multivariate Representation Learning for Information Retrieval. Hamed Zamani et.al. SIGIR 2023. (Learn multivariate distributions)

Knowledge distillation

Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard, Edouard Grave. ICLR 2020. [unofficial code] (Distill cross-attention of reader to retriever)
Distilling Knowledge for Fast Retrieval-based Chat-bots. Amir Vakili Tahami et.al. SIGIR 2020. [code] (Distill from cross-encoders to bi-encoders)
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. Sebastian Hofstätter et.al. Arxiv 2020. [code] (Distill from BERT ensemble)
Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin. Arxiv 2020. [code] (TCTColBERT: distill from ColBERT)
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2, joint learning by distillation)
Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval. Kelong Mao et.al. SIGIR 2022.

Pre-training tailored for dense retrieval

Latent Retrieval for Weakly Supervised Open Domain Question Answering. Kenton Lee et.al. ACL 2019. [code] (ORQA, ICT)
Pre-training tasks for embedding-based large scale retrieval. Wei-Cheng Chang et.al. ICLR 2020. (ICT, BFS and WLP)
REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu, Kenton Lee et.al. ICML 2020. [code] (REALM)
Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. Shuqi Lu, Di He, Chenyan Xiong et.al. EMNLP 2021. [code] (Seed)
Condenser: a Pre-training Architecture for Dense Retrieval. Luyu Gao et.al. EMNLP 2021. [code](Condenser)
Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval. Ning Wu et.al. JICAI 2022. [code](CCP, cross-lingual pre-training)
Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. Luyu Gao et.al. ACL 2022. [code](coCondenser)
LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval. Canwen Xu, Daya Guo et.al. ACL 2022. [code] (LaPraDoR, ICT+dropout)
A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval. Xinyu Ma et.al. CIKM 2022. (CPADE, document term distribution-based contrastive pretraining)
Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction Xinyu Ma et.al. SIGIR 2022. [code](COSTA, group-wise contrastive learning)
H-ERNIE: A Multi-Granularity Pre-Trained Language Model for Web Search. Xiaokai Chu et.al. SIGIR 2022. (H-ERNIE)
Structure and Semantics Preserving Document Representations. Natraj Raman et.al. SIGIR 2022.
Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning. Gautier Izacard et.al. TMLR 2022. [code] (Contriever)
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation. Jeong et.al. ACL 2022. [code] (Augmentation for Dense Retrieval)

Jointly learning retrieval and indexing

Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index. Han Zhang et.al. SIGIR 2021 short. [code] (Poeem)
Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. Jingtao Zhan et.al. CIKM 2021. [code] (JPQ)
Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.Jingtao Zhan et.al. WSDM 2022. [code] (RepCONC)
Matchingoriented Embedding Quantization For Ad-hoc Retrieval. Shitao Xiao et.al. EMNLP 2021. [code]
Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. Shitao Xiao et.al. SIGIR 2022. [code]

Multi-hop dense retrieval

Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval. Wenhan Xiong, Xiang Lorraine Li et.al. ICLR 2021 [code] (Iteratively encode the question and previously retrieved documents as query vectors)

Domain adaptation

Multi-Task Retrieval for Knowledge-Intensive Tasks. Jean Maillard, Vladimir Karpukhin^ et.al. ACL 2021. (Multi-task learning)
Evaluating Extrapolation Performance of Dense Retrieval. Jingtao Zhan et.al. CIKM 2022. [code]

Query reformulation

PseudoRelevance Feedback for Multiple Representation Dense Retrieval. Xiao Wang et.al. ICTIR 2021 (ColBERT-PRF)
Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback. HongChien Yu et.al. CIKM 2021. [code] (ANCE-PRF)
LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback. Yunchang Zhu et.al. SIGIR 2022. [code] (LoL, Pseudo-relevance feedback)

Bias

Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach. Shengyao Zhuang et.al. SIGIR 2022. [code] (CoRocchio, Counterfactual Rocchio algorithm)
Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models. Yinqiong Cai et.al. CIKM 2022.

Hybrid Retrieval

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
Complement Lexical Retrieval Model with Semantic Residual Embeddings. Luyu Gao et.al. ECIR 2021.
BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. Shuai Wang et.al. ICTIR 2021.
Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Shitao Xiao et.al. WWW 2022. [code]

Re-ranking Stage

Basic Usage

Discriminative ranking models

Representation-focused

Understanding the Behaviors of BERT in Ranking. Yifan Qiao et.al. Aixiv 2019. (Representation-focused and Interanction-focused)

Interanction-focused

Passage Re-ranking with BERT. Rodrigo Nogueira et.al. [code] (monoBERT: Maybe the first work on applying BERT to IR)
Multi-Stage Document Ranking with BERT, The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. Rodrigo Nogueira et.al. Arxiv 2020. (Expando-Mono-Duo: doc2query+pointwise+pairwise)
CEDR: Contextualized Embeddings for Document Ranking. Sean MacAvaney et.al. SIGIR 2020 short. [code] (CEDR: BERT+neuIR model)

Generative ranking models

Beyond [CLS] through Ranking by Generation. Cicero Nogueira dos Santos et.al. EMNLP 2020 short. (Query generation using GPT and BART)
Document Ranking with a Pretrained Sequence-to-Sequence Model. Rodrigo Nogueira, Zhiying Jiang et.al. EMNLP 2020. [code] (Relevance token generation using T5)
RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses. Honglei Zhuang et.al. Arxiv 2022.

Hybrid ranking models

Generalizing Discriminative Retrieval Models using Generative Tasks. Bingsheng Liu, Hamed Zamani et.al. WWW 2021. (GDMTL,joint discriminative and generative model with multitask learning)

Long Document Processing Techniques

Passage score aggregation

Deeper Text Understanding for IR with Contextual Neural Language Modeling. Zhuyun Dai et.al. SIGIR 2020 short. [code] (BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level)
Simple Applications of BERT for Ad Hoc Document Retrieval, Applying BERT to Document Retrieval with Birch, Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. Wei Yang, Haotian Zhang et.al. Arxiv 2020, Zeynep Akkalyoncu Yilmaz et.al. EMNLP 2019 short. [code] (Birch: Sentence-level)
Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. Sebastian Hofstätter et.al. SIGIR 2021. [code] (Distill a ranking model to conv-knrm to select top-k passages)

Passage representation aggregation

PARADE: Passage Representation Aggregation for Document Reranking. Canjia Li et.al. Arxiv 2020. [code] (An extensive comparison of various Passage Representation Aggregation methods)
Leveraging Passage-level Cumulative Gain for Document Ranking. Zhijing Wu et.al. WWW 2020. (PCGM)

Designing new architectures

Local Self-Attention over Long Text for Efficient Document Retrieval. Sebastian Hofstätter et.al. SIGIR 2020 short. [code] (TKL:Transformer-Kernel for long text)
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. Liu Yang et.al. CIKM 2020. [code] (SMITH for doc2doc matching)
Socialformer: Social Network Inspired Long Document Modeling for Document Ranking. Yujia Zhou et.al. WWW 2022. (Socialformer)

Improving Efficiency

Decoupling the interaction

DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang, Ping Nie et.al. SIGIR 2020 short. (DC-BERT)
Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. Sean MacAvaney et.al. SIGIR 2020. [code] (PreTTR)
Modularized Transfomer-based Ranking Framework. Luyu Gao et.al. EMNLP 2020. [code] (MORES, similar to PreTTR)
TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. Shengyao Zhuang, Guido Zuccon SIGIR 2021. [code] (TILDE)
Fast Forward Indexes for Efficient Document Ranking. Jurek Leonhardt et.al. WWW 2022. (Fast forward index)

Knowledge distillation

Understanding BERT Rankers Under Distillation. Luyu Gao et.al. ICTIR 2020. (LM Distill + Ranker Distill)
Simplified TinyBERT: Knowledge Distillation for Document Retrieval. Xuanang Chen et.al. ECIR 2021. [code] (TinyBERT+knowledge distillation)

Partial Fine-tuning

Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning. Euna Jung, Jaekeol Choi et.al. WWW 2022. [code] (Lightweight Fine-Tuning)
Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval. Xinyu Ma et.al. CIKM 2022.(IAA, introduce the aside module to stabilize training)

Early exit

The Cascade Transformer: an Application for Efficient Answer Sentence Selection. Luca Soldaini et.al. ACL 2020.[code] (Cascade Transformer: prune candidates by layer)
Early Exiting BERT for Efficient Document Ranking. Ji Xin et.al. EMNLP 2020 SustaiNLP Workshop. [code] (Early exit)

Jointly Learning Retrieval and Re-ranking

RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2)
Adversarial Retriever-Ranker for dense text retrieval. Hang Zhang et.al. ICLR 2022. [code] (AR2)
RankFlow: Joint Optimization of Multi-Stage Cascade Ranking Systems as Flows. Jiarui Qin et.al. SIGIR 2022. (RankFlow)

Model-based IR System

Rethinking Search: Making Domain Experts out of Dilettantes. Donald Metzler et.al. SIGIR Forum 2020. (Envisioned the model-based IR system)
Transformer Memory as a Differentiable Search Index. Yi Tay et.al. Arxiv 2022. (DSI)
DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index. Yujia Zhou et.al. Arxiv 2022. (DynamicRetriever)
A Neural Corpus Indexer for Document Retrieval. Yujing Wang et.al. Arxiv 2022. (NCI)
Autoregressive Search Engines: Generating Substrings as Document Identifiers. Michele Bevilacqua et.al. Arxiv 2022. [code] (SEAL)
CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks. Jiangui Chen et.al. CIKM 2022. [code] (CorpusBrain)
A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning. Jiangui Chen et.al. SIGIR 2023. [code] (UGR)
TOME: A Two-stage Approach for Model-based Retrieval. Ruiyang Ren et.al. ACL 2023. (TOME: Passage generation then URL generation)
How Does Generative Retrieval Scale to Millions of Passages? Ronak Pradeep, Kai Hui et.al. Arxiv 2023. (Comprehensive study on proposed methods, using synthetic queries as document ids)
Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies. Yubao Tang et.al. KDD 2023. (Semantic-Enhanced DSI)

LLM and IR

Perspectives or Surveys

Information Retrieval meets Large Language Models: A strategic report from Chinese IR community. Qingyao AI et.al. The CCIR community. AI Open 2023.
Large Language Models for Information Retrieval: A Survey. Yutao Zhu et.al. Renmin University of China. Arxiv 2023.
Navigating Complex Search Tasks with AI Copilots. Ryen W. White Microsoft Research. Arxiv 2023.

Retrieval Augmented LLM

Retrieval-augmented generation for knowledge-intensive NLP tasks. Patrick Lewis, Ethan Perez et.al. NIPS 2020. (RAG, for 440M BART)
Improving Language Models by Retrieving from Trillions of Tokens. Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann et.al. ICML 2022. [code](*RETRO, enc-dec 7.5B)
Atlas: Few-shot Learning with Retrieval Augmented Language Models. Gautier Izacard, Patrick Lewis et.al. Arxiv 2022. [code] (Atlas, T5, 11B)
Internet-augmented language models through few-shot prompting for open-domain question answering. Angeliki Lazaridou et.al. Arxiv 2022. (Gopher 280B, Conditioning on Google search results)
Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. Zhihong Shao et.al. Arxiv 2023.
Instruction Tuning post Retrieval-Augmented Pretraining. Boxin Wang et.al. Arxiv 2023.
Retrieve Anything To Augment Large Language Models.

LLM for IR

Synthetic Query Generation

Improving Passage Retrieval with Zero-Shot Question Generation. Devendra Singh Sachan et.al. EMNLP 2022. [code](UPR, rerank docs based on query likelihood of GPT-neo 2.7B/T0 3B,11B)
Promptagator: Few-shot Dense Retrieval From 8 Examples. Zhuyun Dai et.al. ICLR 2023. (Generate pseudo queries using in-context learning, FLAN 137B)
UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers. Jon Saad-Falcon, Omar Khattab et.al. Arxiv 2023. [code](Train reranker with generated pseudo quereis with GPT3)
InPars: Data Augmentation for Information Retrieval using Large Language Models. Luiz Bonifacio et.al. Arxiv 2022. [code](Use GPT-3 Curie to generate pseudo quereis with in-context learning, query generation probs to select top-k q-d pairs)
InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. Vitor Jeronymo et.al. Arxiv 2023. [code](silimar to InPars, use GPT-J 6B LLM, and a finetuned reranker as selector)
InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers. Leonid Boytsov et.al. Arxiv 2023. (silimar to InPars, use GPT-J 6B and BLOOM 7B)
Generative Relevance Feedback with Large Language Models. Iain Mackie et.al. SIGIR 2023 short. (GRF, generate various info with GPT3 for relevance feedback)
Query Expansion by Prompting Large Language Models. Rolf Jagerman et.al. Arxiv 2023.
Exploring the Viability of Synthetic Query Generation for Relevance Prediction. Aditi Chaudhary et.al. Arxiv 2023. (FLAN-137B label conditioned generation)
Large Language Model based Long-tail Query Rewriting in Taobao Search. Wenjun Peng et.al. Arxiv 2023.
Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers. Minghan Li et.al. Arxiv 2023. (Use Flan-PaLM2-S for keywords generation)
Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval. Nandan Thakur et.al. Arxiv 2023.

Synthetic Document Generation

Generate rather than Retrieve: Large Language Models are Strong Context Generators. Wenhao Yu et.al. ICLR 2023. [code] (GenRead,generate pseudo doc with InstructGPT for reader)
Recitation-Augmented Language Models. Zhiqing Sun et.al. ICLR 2023. [code] (similar to GenRead)
Precise Zero-Shot Dense Retrieval without Relevance Labels. Luyu Gao, Xueguang Ma et.al. Arxiv 2022. [code] (HyDE,InstructGPT generate pseudo doc and Contriever retireve the real one)
Query2doc: Query Expansion with Large Language Models. Liang Wang et.al. Arxiv 2023. (Generate pseudo docs using in-context learning and then concat with queries, text-davinci-003)
Large Language Models are Strong Zero-Shot Retriever. Tao Shen et.al. Arxiv 2023. (similar to Hyde, augment the LLM with retrieved docs using BM25)
Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts. Arian Askari et.al. Arxiv 2023. [code] (Ranking with synthetic data generated by ChatGPT)

LLM for Relevance Scoring

Task-aware Retrieval with Instructions. Akari Asai, Timo Schick et.al. Arxiv 2022. [code] (TART, BERRI 40 tasks with instructions,1.5B FLAN-T5)
One Embedder, Any Task: Instruction-Finetuned Text Embeddings. Hongjin Su, Weijia Shi et.al. [code](Intructor, 330 diverse tasks, 1.5B model)
ExaRanker: Explanation-Augmented Neural Ranker. Fernando Ferraretto et.al. Arxiv 2023. [code] (Training monoT5 with both relevance score and explanations generated by GPT-3.5 (text-davinci-002))
Perspectives on Large Language Models for Relevance Judgment. Guglielmo Faggioli et.al. Arxiv 2023. (Perspective Paper)
Zero-Shot Listwise Document Reranking with a Large Language Model. Xueguang Ma et.al. Arxiv 2023. (LRL, generate rank list with GPT3)
Large Language Models are Built-in Autoregressive Search Engines. Noah Ziems et.al. Arxiv 2023. (LLM-URL, use GPT-3 text-davinci-003 to generate URL, model-based IR)
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. Weiwei Sun et.al. EMNLP main 2023.[code](Zero-shot Passage reranking with ChatGPT/GPT4)
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. Zhen Qin et.al. Arxiv 2023.
RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models. Ronak Pradeep et.al. Arxiv 2023. [code]
Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. Raphael Tang, Xinyu Zhang et.al. Arxiv 2023. [code]
Fine-Tuning LLaMA for Multi-Stage Text Retrieval. Xueguang Ma et.al. Arxiv 2023.
A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. Shengyao Zhuang et.al. Arxiv 2023.
Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking. Shengyao Zhuang et.al. Arxiv 2023. [code]
PaRaDe: Passage Ranking using Demonstrations with Large Language Models. Andrew Drozdov et.al. Arxiv 2023.
Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels. Honglei Zhuang et.al. Arxiv 2023.
Large Language Models can Accurately Predict Searcher Preferences. Paul Thomas et.al. Arxiv 2023.
RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! Ronak Pradeep et.al. Arxiv 2023.
Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models. Xinyu Zhang et.al. Arxiv 2023.

LLM for Generative Retrieval

ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models. Haoxin Li et.al. Arxiv 2023. (Using GPT-3.5 generate keyphrases)

Retrieval-Augmented Text Generation

WebGPT: Browser-assisted question-answering with human feedback. Reiichiro Nakano,Jacob Hilton,Suchir Balaji et.al. Arxiv 2022. (WebGPT, GPT3)
Teaching language models to support answers with verified quotes. DeepMind Arxiv 2022.
Evaluating Verifiability in Generative Search Engines. Nelson F. Liu et.al. Arxiv 2023. [code]
Enabling Large Language Models to Generate Text with Citations. Tianyu Gao et.al. Arxiv 2023. [code] (ALCE benchmark)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. Tu Vu et.al. Arxiv 2023. [code]
Retrieve Anything To Augment Large Language Models. Peitian Zhang, Shitao Xiao et.al. Arxiv 2023. [code]
Leveraging Event Schema to Ask Clarifying Questions for Conversational Legal Case Retrieval. Bulou Liu et.al. CIKM 2023.
Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher. Xiang Shi et.al.
Evaluating Generative Ad Hoc Information Retrieval. Lukas Gienapp et.al. Arxiv 2023.

Others

Demonstrate–Search–Predict: Composing retrieval and language models for knowledge-intensive NLP. Omar Khattab et.al. Arxiv 2023.[code](DSP program, GPT3.5)

Multimodal Retrieval

Unified Single-stream Architecture

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. Gen Li, Nan Duan et.al. AAAI 2020. [code] (Unicoder-VL)
XGPT: Cross-modal Generative Pre-Training for Image Captioning. Qiaolin Xia, Haoyang Huang, Nan Duan et.al. Arxiv 2020. [code] (XGPT)
UNITER: UNiversal Image-TExt Representation Learning. Yen-Chun Chen, Linjie Li et.al. ECCV 2020. [code] (UNITER)
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Xiujun Li, Xi Yin et.al. ECCV 2020. [code] (Oscar)
VinVL: Making Visual Representations Matter in Vision-Language Models. Pengchuan Zhang, Xiujun Li et.al. ECCV 2020. [code] (VinVL)
Dynamic Modality Interaction Modeling for Image-Text Retrieval. Leigang Qu et.al. SIGIR 2021 Best student paper. [code] (DIME)

Multi-stream Architecture Applied on Input

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra et.al. NeurIPS 2019. [code] (VilBERT)
12-in-1: Multi-Task Vision and Language Representation Learning. Jiasen Lu, Dhruv Batra et.al. CVPR 2020. [code] (A multi-task model based on VilBERT)
Learning Transferable Visual Models From Natural Language Supervision. Alec Radford et.al. CVPR 2020. [code] (CLIP, GPT team)
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Fei Yu, Jiji Tang et.al. Arxiv 2020. [code] (ERNIE-ViL，1st place on the VCR leaderboard)
M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. Junyang Lin, An Yang et.al. KDD 2020. (M6-v0/InterBERT)
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. Haoyang Huang, Lin Su et.al. CVPR 2021. [code] (M3P, MILD dataset)

Other Resources

Some Retrieval Toolkits

Other Resources About Pre-trained Models in NLP

Pre-trained Models for Natural Language Processing: A Survey. Xipeng Qiu et.al.
BERT-related-papers
Pre-trained Languge Model Papers from THU-NLP

Surveys About Efficient Transformers

Efficient Transformers: A Survey. Yi Tay, Mostafa Dehghani et.al. Arxiv 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
imgs		imgs
README.md		README.md

ict-bigdatalab/awesome-pretrained-models-for-information-retrieval

Folders and files

Latest commit

History

Repository files navigation

awesome-pretrained-models-for-information-retrieval

Pre-training for IR

Survey Papers

First Stage Retrieval

Sparse Retrieval

Neural term re-weighting

Query or document expansion

Sparse representation learning

Dense Retrieval

Hard negative sampling

Late interaction and multi-vector representation

Knowledge distillation

Pre-training tailored for dense retrieval

Jointly learning retrieval and indexing

Multi-hop dense retrieval

Domain adaptation

Query reformulation

Bias

Hybrid Retrieval

Re-ranking Stage

Basic Usage

Discriminative ranking models

Representation-focused

Interanction-focused

Generative ranking models

Hybrid ranking models

Long Document Processing Techniques

Passage score aggregation

Passage representation aggregation

Designing new architectures

Improving Efficiency

Decoupling the interaction

Knowledge distillation

Partial Fine-tuning

Early exit

Other Topics

Query Expansion

Re-weighting Training Samples

Pre-training Tailored for Re-ranking

Adversarial Attack and Defence

Cross-lingual Retrieval

Jointly Learning Retrieval and Re-ranking

Model-based IR System

LLM and IR

Perspectives or Surveys

Retrieval Augmented LLM

LLM for IR

Synthetic Query Generation

Synthetic Document Generation

LLM for Relevance Scoring

LLM for Generative Retrieval

Retrieval-Augmented Text Generation

Others

Multimodal Retrieval

Unified Single-stream Architecture

Multi-stream Architecture Applied on Input

Other Resources

Some Retrieval Toolkits

Other Resources About Pre-trained Models in NLP

Surveys About Efficient Transformers

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 5

Packages