A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pre-training for IR). If I missed any papers, feel free to open a PR to include them! And any feedback and contributions are welcome!
- Pre-training Methods in Information Retrieval. Yixing Fan, Xiaohui Xie et.al. FnTIR 2022
- Dense Text Retrieval based on Pretrained Language Models: A Survey. Wayne Xin Zhao, Jing Liu et.al. Arxiv 2022
- Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et.al. M&C 2021
- Semantic Models for the First-stage Retrieval: A Comprehensive Review. Jiafeng Guo et.al. TOIS 2021
- A Deep Look into neural ranking models for information retrieval. Jiafeng Guo et.al. IPM 2020
- Learning to Reweight Terms with Distributed Representations. Guoqing Zheng, Jamie Callan SIGIR 2015.(DeepTR)
- Context-Aware Term Weighting For First Stage Passage Retrieval. Zhuyun Dai et.al. SIGIR 2020 short. [code] (DeepCT)
- Context-Aware Document Term Weighting for Ad-Hoc Search. Zhuyun Dai et.al. WWW 2020. [code] (HDCT)
- Learning Term Discrimination. Jibril Frej et.al. SIGIR 2020. (IDF-reweighting)
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. Luyu Gao et.al. NAACL 2020. [code] (COIL)
- Learning Passage Impacts for Inverted Indexes. Antonio Mallia et.al. SIGIR 2021 short. [code] (DeepImapct)
- Document Expansion by Query Prediction. Rodrigo Nogueira et.al. [doc2query code, docTTTTTquery code] (doc2query, docTTTTTquery)
- Generation-Augmented Retrieval for Open-Domain Question Answering. Yuning Mao et.al. ACL 2021. [code] (query expansion with BART)
- Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation. Jeong et.al. arXiv 2021. [code] (unsupervised document expansion)
- SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. Yang Bai, Xiaoguang Li et.al. Arxiv 2020. (SparTerm: Term importance distribution from MLM+Binary Term Gating)
- Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. Jinhyuk Lee, Minjoon Seo et.al. ACL 2020. [code] (SPARC, sparse vectors)
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking., and v2. Thibault Formal et.al. SIGIR 2021. [code](SPLADE)
- Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. Kyoung-Rok Jang et.al. EMNLP 2021. (UHD)
- Efficient Passage Retrieval with Hashing for Open-domain Question Answering. Ikuya Yamada et.al. ACL 2021. [code] (BPR, convert embedding vector to binary codes)
- Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020 [code] (DPR, in-batch negatives)
- RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et.al. Arxiv 2020. [code] (RepBERT)
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong et.al. [code] (ANCE, refresh index during training)
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et.al. NAACL 2021. (RocketQA: cross-batch negatives, denoise hard negatives and data augementation)
- Optimizing Dense Retrieval Model Training with Hard Negatives. Jingtao Zhan et.al. SIGIR 2021.[code] (ADORE&STAR, query-side finetuning build on pretrained document encoders)
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
- PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval Ruiyang Ren et.al. EMNLP Findings 2021. [code] (PAIR)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020. [code] (ColBERT)
- Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau,Kurt Shuster et.al. ICLR 2020. [code] (Poly-encoders)
- Sparse, Dense, and Attentional Representations for Text Retrieval. Yi Luan, Jacob Eisenstein et.al. TACL 2020. (ME-BERT, multi-vectors)
- Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval. Hongyin Tang, Xingwu Sun et.al. ACL 2021.
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
- Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ACL 2021. [code] (DensePhrases)
- Multi-View Document Representation Learning for Open-Domain Dense Retrieval. Shunyu Zhang et.al. ACL 2022. (MVR)
- Multivariate Representation Learning for Information Retrieval. Hamed Zamani et.al. SIGIR 2023. (Learn multivariate distributions)
- Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard, Edouard Grave. ICLR 2020. [unofficial code] (Distill cross-attention of reader to retriever)
- Distilling Knowledge for Fast Retrieval-based Chat-bots. Amir Vakili Tahami et.al. SIGIR 2020. [code] (Distill from cross-encoders to bi-encoders)
- Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. Sebastian Hofstätter et.al. Arxiv 2020. [code] (Distill from BERT ensemble)
- Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin. Arxiv 2020. [code] (TCTColBERT: distill from ColBERT)
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2, joint learning by distillation)
- Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval. Kelong Mao et.al. SIGIR 2022.
- Latent Retrieval for Weakly Supervised Open Domain Question Answering. Kenton Lee et.al. ACL 2019. [code] (ORQA, ICT)
- Pre-training tasks for embedding-based large scale retrieval. Wei-Cheng Chang et.al. ICLR 2020. (ICT, BFS and WLP)
- REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu, Kenton Lee et.al. ICML 2020. [code] (REALM)
- Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. Shuqi Lu, Di He, Chenyan Xiong et.al. EMNLP 2021. [code] (Seed)
- Condenser: a Pre-training Architecture for Dense Retrieval. Luyu Gao et.al. EMNLP 2021. [code](Condenser)
- Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval. Ning Wu et.al. JICAI 2022. [code](CCP, cross-lingual pre-training)
- Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. Luyu Gao et.al. ACL 2022. [code](coCondenser)
- LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval. Canwen Xu, Daya Guo et.al. ACL 2022. [code] (LaPraDoR, ICT+dropout)
- A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval. Xinyu Ma et.al. CIKM 2022. (CPADE, document term distribution-based contrastive pretraining)
- Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction Xinyu Ma et.al. SIGIR 2022. [code](COSTA, group-wise contrastive learning)
- H-ERNIE: A Multi-Granularity Pre-Trained Language Model for Web Search. Xiaokai Chu et.al. SIGIR 2022. (H-ERNIE)
- Structure and Semantics Preserving Document Representations. Natraj Raman et.al. SIGIR 2022.
- Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning. Gautier Izacard et.al. TMLR 2022. [code] (Contriever)
- Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation. Jeong et.al. ACL 2022. [code] (Augmentation for Dense Retrieval)
- Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index. Han Zhang et.al. SIGIR 2021 short. [code] (Poeem)
- Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. Jingtao Zhan et.al. CIKM 2021. [code] (JPQ)
- Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.Jingtao Zhan et.al. WSDM 2022. [code] (RepCONC)
- Matchingoriented Embedding Quantization For Ad-hoc Retrieval. Shitao Xiao et.al. EMNLP 2021. [code]
- Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings. Shitao Xiao et.al. SIGIR 2022. [code]
- Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval. Wenhan Xiong, Xiang Lorraine Li et.al. ICLR 2021 [code] (Iteratively encode the question and previously retrieved documents as query vectors)
- Multi-Task Retrieval for Knowledge-Intensive Tasks. Jean Maillard, Vladimir Karpukhin^ et.al. ACL 2021. (Multi-task learning)
- Evaluating Extrapolation Performance of Dense Retrieval. Jingtao Zhan et.al. CIKM 2022. [code]
- PseudoRelevance Feedback for Multiple Representation Dense Retrieval. Xiao Wang et.al. ICTIR 2021 (ColBERT-PRF)
- Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback. HongChien Yu et.al. CIKM 2021. [code] (ANCE-PRF)
- LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback. Yunchang Zhu et.al. SIGIR 2022. [code] (LoL, Pseudo-relevance feedback)
- Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach. Shengyao Zhuang et.al. SIGIR 2022. [code] (CoRocchio, Counterfactual Rocchio algorithm)
- Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models. Yinqiong Cai et.al. CIKM 2022.
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
- Complement Lexical Retrieval Model with Semantic Residual Embeddings. Luyu Gao et.al. ECIR 2021.
- BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. Shuai Wang et.al. ICTIR 2021.
- Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Shitao Xiao et.al. WWW 2022. [code]
- Understanding the Behaviors of BERT in Ranking. Yifan Qiao et.al. Aixiv 2019. (Representation-focused and Interanction-focused)
- Passage Re-ranking with BERT. Rodrigo Nogueira et.al. [code] (monoBERT: Maybe the first work on applying BERT to IR)
- Multi-Stage Document Ranking with BERT, The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. Rodrigo Nogueira et.al. Arxiv 2020. (Expando-Mono-Duo: doc2query+pointwise+pairwise)
- CEDR: Contextualized Embeddings for Document Ranking. Sean MacAvaney et.al. SIGIR 2020 short. [code] (CEDR: BERT+neuIR model)
- Beyond [CLS] through Ranking by Generation. Cicero Nogueira dos Santos et.al. EMNLP 2020 short. (Query generation using GPT and BART)
- Document Ranking with a Pretrained Sequence-to-Sequence Model. Rodrigo Nogueira, Zhiying Jiang et.al. EMNLP 2020. [code] (Relevance token generation using T5)
- RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses. Honglei Zhuang et.al. Arxiv 2022.
- Generalizing Discriminative Retrieval Models using Generative Tasks. Bingsheng Liu, Hamed Zamani et.al. WWW 2021. (GDMTL,joint discriminative and generative model with multitask learning)
- Deeper Text Understanding for IR with Contextual Neural Language Modeling. Zhuyun Dai et.al. SIGIR 2020 short. [code] (BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level)
- Simple Applications of BERT for Ad Hoc Document Retrieval, Applying BERT to Document Retrieval with Birch, Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. Wei Yang, Haotian Zhang et.al. Arxiv 2020, Zeynep Akkalyoncu Yilmaz et.al. EMNLP 2019 short. [code] (Birch: Sentence-level)
- Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. Sebastian Hofstätter et.al. SIGIR 2021. [code] (Distill a ranking model to conv-knrm to select top-k passages)
- PARADE: Passage Representation Aggregation for Document Reranking. Canjia Li et.al. Arxiv 2020. [code] (An extensive comparison of various Passage Representation Aggregation methods)
- Leveraging Passage-level Cumulative Gain for Document Ranking. Zhijing Wu et.al. WWW 2020. (PCGM)
- Local Self-Attention over Long Text for Efficient Document Retrieval. Sebastian Hofstätter et.al. SIGIR 2020 short. [code] (TKL:Transformer-Kernel for long text)
- Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. Liu Yang et.al. CIKM 2020. [code] (SMITH for doc2doc matching)
- Socialformer: Social Network Inspired Long Document Modeling for Document Ranking. Yujia Zhou et.al. WWW 2022. (Socialformer)
- DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang, Ping Nie et.al. SIGIR 2020 short. (DC-BERT)
- Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. Sean MacAvaney et.al. SIGIR 2020. [code] (PreTTR)
- Modularized Transfomer-based Ranking Framework. Luyu Gao et.al. EMNLP 2020. [code] (MORES, similar to PreTTR)
- TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. Shengyao Zhuang, Guido Zuccon SIGIR 2021. [code] (TILDE)
- Fast Forward Indexes for Efficient Document Ranking. Jurek Leonhardt et.al. WWW 2022. (Fast forward index)
- Understanding BERT Rankers Under Distillation. Luyu Gao et.al. ICTIR 2020. (LM Distill + Ranker Distill)
- Simplified TinyBERT: Knowledge Distillation for Document Retrieval. Xuanang Chen et.al. ECIR 2021. [code] (TinyBERT+knowledge distillation)
- Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning. Euna Jung, Jaekeol Choi et.al. WWW 2022. [code] (Lightweight Fine-Tuning)
- Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval. Xinyu Ma et.al. CIKM 2022.(IAA, introduce the aside module to stabilize training)
- The Cascade Transformer: an Application for Efficient Answer Sentence Selection. Luca Soldaini et.al. ACL 2020.[code] (Cascade Transformer: prune candidates by layer)
- Early Exiting BERT for Efficient Document Ranking. Ji Xin et.al. EMNLP 2020 SustaiNLP Workshop. [code] (Early exit)
- BERT-QE: Contextualized Query Expansion for Document Re-ranking. Zhi Zheng et.al. EMNLP 2020 Findings. [code] (BERT-QE)
- Training Curricula for Open Domain Answer Re-Ranking. Sean MacAvaney et.al. SIGIR 2020. [code] (curriculum learning based on BM25)
- Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models. Daniel Cohen et.al. SIGIR 2021.
- MarkedBERT: Integrating Traditional IR Cues in Pre-trained Language Models for Passage Retrieval. Lila Boualili et.al. SIGIR 2020 short. [code] (MarkedBERT)
- Selective Weak Supervision for Neural Information Retrieval. Kaitao Zhang et.al. WWW 2020. [code] (ReInfoSelect)
- PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. WSDM 2021. [code] (PROP)
- Cross-lingual Language Model Pretraining for Retrieval. Puxuan Yu et.al. WWW 2021.
- B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. SIGIR 2021. [code] (B-PROP)
- Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need. Zhengyi Ma et.al. CIKM 2021. [code] (HARP)
- Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking. Yutao Zhu et.al. CIKM 2021. [code](COCA)
- Pre-trained Language Model based Ranking in Baidu Search. Lixin Zou et.al. KDD 2021.
- A Unified Pretraining Framework for Passage Ranking and Expansion. Ming Yan et.al. AAAI 2021. (UED, jointly training ranking and query generation)
- Axiomatically Regularized Pre-training for Ad hoc Search. Jia Chen et.al. SIGIR 2022. [code] (ARES)
- Webformer: Pre-training with Web Pages for Information Retrieval. Yu Guo et.al. SIGIR 2022. (Webformer)
- Competitive Search. Oren Kurland et.al. SIGIR 2022.
- PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models. Chen Wu et.al. Arxiv 2022
- Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models Jiawei Liu et.al. CCS 2022
- Are Neural Ranking Models Robust? Chen Wu et.al. TOIS
- Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models Chen Wu et.al. CIKM 2022
- Topic-oriented Adversarial Attacks against Black-box Neural Ranking Models. Yu-An Liu et.al. SIGIR 2023.
- Cross-lingual Retrieval for Iterative Self-Supervised Training. Chau Tran et.al. NIPS 2020. [code] (CRISS)
- CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. Shuo Sun et.al. EMNLP 2020. [code] (Multilingual dataset-CLIRMatrix and multilingual BERT)
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2)
- Adversarial Retriever-Ranker for dense text retrieval. Hang Zhang et.al. ICLR 2022. [code] (AR2)
- RankFlow: Joint Optimization of Multi-Stage Cascade Ranking Systems as Flows. Jiarui Qin et.al. SIGIR 2022. (RankFlow)
- Rethinking Search: Making Domain Experts out of Dilettantes. Donald Metzler et.al. SIGIR Forum 2020. (Envisioned the model-based IR system)
- Transformer Memory as a Differentiable Search Index. Yi Tay et.al. Arxiv 2022. (DSI)
- DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index. Yujia Zhou et.al. Arxiv 2022. (DynamicRetriever)
- A Neural Corpus Indexer for Document Retrieval. Yujing Wang et.al. Arxiv 2022. (NCI)
- Autoregressive Search Engines: Generating Substrings as Document Identifiers. Michele Bevilacqua et.al. Arxiv 2022. [code] (SEAL)
- CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks. Jiangui Chen et.al. CIKM 2022. [code] (CorpusBrain)
- A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning. Jiangui Chen et.al. SIGIR 2023. [code] (UGR)
- TOME: A Two-stage Approach for Model-based Retrieval. Ruiyang Ren et.al. ACL 2023. (TOME: Passage generation then URL generation)
- How Does Generative Retrieval Scale to Millions of Passages? Ronak Pradeep, Kai Hui et.al. Arxiv 2023. (Comprehensive study on proposed methods, using synthetic queries as document ids)
- Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies. Yubao Tang et.al. KDD 2023. (Semantic-Enhanced DSI)
- Information Retrieval meets Large Language Models: A strategic report from Chinese IR community. Qingyao AI et.al. The CCIR community. AI Open 2023.
- Large Language Models for Information Retrieval: A Survey. Yutao Zhu et.al. Renmin University of China. Arxiv 2023.
- Navigating Complex Search Tasks with AI Copilots. Ryen W. White Microsoft Research. Arxiv 2023.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. Patrick Lewis, Ethan Perez et.al. NIPS 2020. (RAG, for 440M BART)
- Improving Language Models by Retrieving from Trillions of Tokens. Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann et.al. ICML 2022. [code](*RETRO, enc-dec 7.5B)
- Atlas: Few-shot Learning with Retrieval Augmented Language Models. Gautier Izacard, Patrick Lewis et.al. Arxiv 2022. [code] (Atlas, T5, 11B)
- Internet-augmented language models through few-shot prompting for open-domain question answering. Angeliki Lazaridou et.al. Arxiv 2022. (Gopher 280B, Conditioning on Google search results)
- Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. Zhihong Shao et.al. Arxiv 2023.
- Instruction Tuning post Retrieval-Augmented Pretraining. Boxin Wang et.al. Arxiv 2023.
- Retrieve Anything To Augment Large Language Models.
- Improving Passage Retrieval with Zero-Shot Question Generation. Devendra Singh Sachan et.al. EMNLP 2022. [code](UPR, rerank docs based on query likelihood of GPT-neo 2.7B/T0 3B,11B)
- Promptagator: Few-shot Dense Retrieval From 8 Examples. Zhuyun Dai et.al. ICLR 2023. (Generate pseudo queries using in-context learning, FLAN 137B)
- UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers. Jon Saad-Falcon, Omar Khattab et.al. Arxiv 2023. [code](Train reranker with generated pseudo quereis with GPT3)
- InPars: Data Augmentation for Information Retrieval using Large Language Models. Luiz Bonifacio et.al. Arxiv 2022. [code](Use GPT-3 Curie to generate pseudo quereis with in-context learning, query generation probs to select top-k q-d pairs)
- InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. Vitor Jeronymo et.al. Arxiv 2023. [code](silimar to InPars, use GPT-J 6B LLM, and a finetuned reranker as selector)
- InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers. Leonid Boytsov et.al. Arxiv 2023. (silimar to InPars, use GPT-J 6B and BLOOM 7B)
- Generative Relevance Feedback with Large Language Models. Iain Mackie et.al. SIGIR 2023 short. (GRF, generate various info with GPT3 for relevance feedback)
- Query Expansion by Prompting Large Language Models. Rolf Jagerman et.al. Arxiv 2023.
- Exploring the Viability of Synthetic Query Generation for Relevance Prediction. Aditi Chaudhary et.al. Arxiv 2023. (FLAN-137B label conditioned generation)
- Large Language Model based Long-tail Query Rewriting in Taobao Search. Wenjun Peng et.al. Arxiv 2023.
- Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers. Minghan Li et.al. Arxiv 2023. (Use Flan-PaLM2-S for keywords generation)
- Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval. Nandan Thakur et.al. Arxiv 2023.
- Generate rather than Retrieve: Large Language Models are Strong Context Generators. Wenhao Yu et.al. ICLR 2023. [code] (GenRead,generate pseudo doc with InstructGPT for reader)
- Recitation-Augmented Language Models. Zhiqing Sun et.al. ICLR 2023. [code] (similar to GenRead)
- Precise Zero-Shot Dense Retrieval without Relevance Labels. Luyu Gao, Xueguang Ma et.al. Arxiv 2022. [code] (HyDE,InstructGPT generate pseudo doc and Contriever retireve the real one)
- Query2doc: Query Expansion with Large Language Models. Liang Wang et.al. Arxiv 2023. (Generate pseudo docs using in-context learning and then concat with queries, text-davinci-003)
- Large Language Models are Strong Zero-Shot Retriever. Tao Shen et.al. Arxiv 2023. (similar to Hyde, augment the LLM with retrieved docs using BM25)
- Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts. Arian Askari et.al. Arxiv 2023. [code] (Ranking with synthetic data generated by ChatGPT)
- Task-aware Retrieval with Instructions. Akari Asai, Timo Schick et.al. Arxiv 2022. [code] (TART, BERRI 40 tasks with instructions,1.5B FLAN-T5)
- One Embedder, Any Task: Instruction-Finetuned Text Embeddings. Hongjin Su, Weijia Shi et.al. [code](Intructor, 330 diverse tasks, 1.5B model)
- ExaRanker: Explanation-Augmented Neural Ranker. Fernando Ferraretto et.al. Arxiv 2023. [code] (Training monoT5 with both relevance score and explanations generated by GPT-3.5 (text-davinci-002))
- Perspectives on Large Language Models for Relevance Judgment. Guglielmo Faggioli et.al. Arxiv 2023. (Perspective Paper)
- Zero-Shot Listwise Document Reranking with a Large Language Model. Xueguang Ma et.al. Arxiv 2023. (LRL, generate rank list with GPT3)
- Large Language Models are Built-in Autoregressive Search Engines. Noah Ziems et.al. Arxiv 2023. (LLM-URL, use GPT-3 text-davinci-003 to generate URL, model-based IR)
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. Weiwei Sun et.al. EMNLP main 2023.[code](Zero-shot Passage reranking with ChatGPT/GPT4)
- Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. Zhen Qin et.al. Arxiv 2023.
- RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models. Ronak Pradeep et.al. Arxiv 2023. [code]
- Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. Raphael Tang, Xinyu Zhang et.al. Arxiv 2023. [code]
- Fine-Tuning LLaMA for Multi-Stage Text Retrieval. Xueguang Ma et.al. Arxiv 2023.
- A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. Shengyao Zhuang et.al. Arxiv 2023.
- Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking. Shengyao Zhuang et.al. Arxiv 2023. [code]
- PaRaDe: Passage Ranking using Demonstrations with Large Language Models. Andrew Drozdov et.al. Arxiv 2023.
- Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels. Honglei Zhuang et.al. Arxiv 2023.
- Large Language Models can Accurately Predict Searcher Preferences. Paul Thomas et.al. Arxiv 2023.
- RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! Ronak Pradeep et.al. Arxiv 2023.
- Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models. Xinyu Zhang et.al. Arxiv 2023.
- ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models. Haoxin Li et.al. Arxiv 2023. (Using GPT-3.5 generate keyphrases)
- WebGPT: Browser-assisted question-answering with human feedback. Reiichiro Nakano,Jacob Hilton,Suchir Balaji et.al. Arxiv 2022. (WebGPT, GPT3)
- Teaching language models to support answers with verified quotes. DeepMind Arxiv 2022.
- Evaluating Verifiability in Generative Search Engines. Nelson F. Liu et.al. Arxiv 2023. [code]
- Enabling Large Language Models to Generate Text with Citations. Tianyu Gao et.al. Arxiv 2023. [code] (ALCE benchmark)
- FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. Tu Vu et.al. Arxiv 2023. [code]
- Retrieve Anything To Augment Large Language Models. Peitian Zhang, Shitao Xiao et.al. Arxiv 2023. [code]
- Leveraging Event Schema to Ask Clarifying Questions for Conversational Legal Case Retrieval. Bulou Liu et.al. CIKM 2023.
- Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher. Xiang Shi et.al.
- Evaluating Generative Ad Hoc Information Retrieval. Lukas Gienapp et.al. Arxiv 2023.
- Demonstrate–Search–Predict: Composing retrieval and language models for knowledge-intensive NLP. Omar Khattab et.al. Arxiv 2023.[code](DSP program, GPT3.5)
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. Gen Li, Nan Duan et.al. AAAI 2020. [code] (Unicoder-VL)
- XGPT: Cross-modal Generative Pre-Training for Image Captioning. Qiaolin Xia, Haoyang Huang, Nan Duan et.al. Arxiv 2020. [code] (XGPT)
- UNITER: UNiversal Image-TExt Representation Learning. Yen-Chun Chen, Linjie Li et.al. ECCV 2020. [code] (UNITER)
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Xiujun Li, Xi Yin et.al. ECCV 2020. [code] (Oscar)
- VinVL: Making Visual Representations Matter in Vision-Language Models. Pengchuan Zhang, Xiujun Li et.al. ECCV 2020. [code] (VinVL)
- Dynamic Modality Interaction Modeling for Image-Text Retrieval. Leigang Qu et.al. SIGIR 2021 Best student paper. [code] (DIME)
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra et.al. NeurIPS 2019. [code] (VilBERT)
- 12-in-1: Multi-Task Vision and Language Representation Learning. Jiasen Lu, Dhruv Batra et.al. CVPR 2020. [code] (A multi-task model based on VilBERT)
- Learning Transferable Visual Models From Natural Language Supervision. Alec Radford et.al. CVPR 2020. [code] (CLIP, GPT team)
- ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Fei Yu, Jiji Tang et.al. Arxiv 2020. [code] (ERNIE-ViL,1st place on the VCR leaderboard)
- M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. Junyang Lin, An Yang et.al. KDD 2020. (M6-v0/InterBERT)
- M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. Haoyang Huang, Lin Su et.al. CVPR 2021. [code] (M3P, MILD dataset)
- Faiss: a library for efficient similarity search and clustering of dense vectors
- Pyserini: a Python Toolkit to Support Sparse and Dense Representations
- MatchZoo: a library consisting of many popular neural text matching models
- Pre-trained Models for Natural Language Processing: A Survey. Xipeng Qiu et.al.
- BERT-related-papers
- Pre-trained Languge Model Papers from THU-NLP
- Efficient Transformers: A Survey. Yi Tay, Mostafa Dehghani et.al. Arxiv 2020.