A curated list of resources (paper, code, data) on video understanding research. (sorted by release date)
🚀 This repo will be continuously updated.
⭐️ Please Star it if you find it helpful!
🤝 Feel free to submit a PR or open an issue with suggestions or improvements.
Table of Contents
Name | Paper | Task | Note |
---|---|---|---|
TraveLER @Berkeley |
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering 24.04.01 / EMNLP'24 / Project Page |
QA | / |
VideoAgent @BIGAI |
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding 24.03.18 / ECCV'24 / Project Page |
QA / Temporal Grounding | / |
VideoAgent @Stanford |
VideoAgent: Long-form Video Understanding with Large Language Model as Agent 24.03.15 / ECCV'24 / Project Page |
QA | / |
Name | Paper | Metadata | Note |
---|---|---|---|
HourVideo @Stanford |
HourVideo: 1-Hour Video-Language Understanding 24.11.07 / NIPS'24 D&B / Project Page |
LLM+Human Annotated / 500 videos / 20~120m / 13K QAs | Long / Egocentric |
TOMATO @Yale |
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models 24.10.31 / ArXiv / Project Page |
Human Annotated / 1.4K videos / 0~72s / 1.5K QAs | / |
TemporalBench @UWM |
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models 24.10.14 / ArXiv / Project Page |
Human+LLM Annotated / 2K videos / 0~20m / 10K QAs | / |
LongVideoBench @NTU |
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding 24.07.22 / NIPS'24 D&B / Project Page |
Humman Annotated / 3.8K videos / 0~1h / 6.7K QAs | / |
LVBench @Zhipu |
LVBench: An Extreme Long Video Understanding Benchmark 24.06.12 / ArXiv / Project Page |
Humman Annotated / 500 videos / avg. 1h / 1.5k QAs | / |
VideoMME @VideoMME-Team |
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis 24.05.31 / ArXiv / Project Page |
Humman Annotated / 900 videos / 0~60m / 2.7K QAs | / |
TempCompass @PKU |
TempCompass: Do Video LLMs Really Understand Videos? 24.03.01 / ACL'24 Findings / Project Page |
ChatGPT+Human Annotated / 410 videos / 0~35s / 7.5K QAs | / |
Name | Paper | Metadata | Note |
---|---|---|---|
VDC @UW |
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark 24.10.04 / ArXiv / Project Page |
GPT-4o Annotated / 1027 videos / 0~60s / 1027 captions | Evaluate Captioning using QAs |
Name | Paper | Metadata | Note |
---|---|---|---|
QVHightlight @UNC |
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries 21.07.20 / NIPS'21 / Project Page |
Human Annotated / 10K videos / avg. 150s / 10K queries | / |
Charades-STA @USC |
TALL: Temporal Activity Localization via Language Query 17.05.05 / ICCV'17 / Project Page |
Rule+Human Annotated / 4233 clip-sentence pairs | / |
ActivityNet Captions @Stanford |
Dense-Captioning Events in Videos 17.05.05 / ICCV'17 / Project Page |
Human Annotated / 20K videos / 0~270s | / |
YouCook2 @Google Brain |
Towards Automatic Learning of Procedures from Web Instructional Videos 17.03.28 / AAAI'18 / Project Page |
Human Annotated / 2K videos / 0~800s / avg. 7.7 segments per video | / |
Name | Paper | Metadata | Note |
---|---|---|---|
FineGym @CUHK |
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding 20.04.14 / CVPR'20 / Project Page |
Human Annotated | / |
Name | Paper | Metadata | Note |
---|---|---|---|
VideoHallucer @BIGAI |
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models 24.06.24 / ArXiv / Project Page |
Rule+Human Annotated / 948 videos / 7~187s / 1.8K QAs | / |
Name | Paper | Data | Metadata |
---|---|---|---|
ShareGPTVideo @CMU |
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward 24.04.01 / ArXiv / Project Page |
Dataset | GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions |
Name | Paper | Data | Metadata |
---|---|---|---|
LLaVA-Video-178k @ByteDance |
Video Instruction Tuning with Synthetic Data 24.10.03 / ArXiv / Project Page |
Dataset | GPT-4o Annotated (1 FPS) / 178K videos / 0~3m / 178K Captions / 1.1M QAs |
ShareGPTVideo @CMU |
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward 24.04.01 / ArXiv / Project Page |
Dataset | GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions / 900K QAs |
Name | Paper | Data | Metadata |
---|---|---|---|
ShareGPTVideo @CMU |
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward 24.04.01 / ArXiv / Project Page |
Dataset | ChatGPT Annotated / 17K videos / 17K preference data |
Name | Paper | Note |
---|---|---|
ElasticTok @Berkeley |
ElasticTok: Adaptive Tokenization for Image and Video 24.10.10 / ArXiv / Project Page |
Visual Tokenizer |
VideoPrism |
VideoPrism: A Foundational Visual Encoder for Video Understanding 24.02.20 / ICML'24 / Project Page |
Video Encoder |
MMVP @NYU |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs 24.01.11 / ArXiv / Project Page |
Hybrid Encoder |
Name | Paper | Note |
---|---|---|
RLT @CMU |
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization 24.11.07 / NIPS'24 / Project Page |
Run-Length Tokenization |
InTI @NJU |
Dynamic and Compressive Adaptation of Transformers From Images to Videos 24.08.13 / ECCV'24 / Project Page |
Dynamic Inter-frame token interpolation |
Cambrian-1 @NYU |
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs 24.06.24 / ArXiv / Project Page |
Spatial Vision Aggregator |
FastV @PKU |
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models 24.03.11 / ECCV'24 / Project Page |
Prune tokens after layer 2 |
Name | Paper | Note |
---|---|---|
Streaming_VDC |
Streaming Dense Video Captioning 24.04.01 / CVPR'24 / Project Page |
Framework |