awesome-video-understanding

A curated list of resources (paper, code, data) on video understanding research. (sorted by release date)

🚀 This repo will be continuously updated.
⭐️ Please Star it if you find it helpful!
🤝 Feel free to submit a PR or open an issue with suggestions or improvements.

Table of Contents

Models
- Large Multimodal Models
- Agents
Benchmarks
Datasets
Research Topics

Models

Large Multimodal Models

Name	Paper	Task	Note
LongVU @Meta	LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding 24.10.22 / ArXiv / Project Page	General QA	/
Aria @Rhymes AI	Aria: An Open Multimodal Native Mixture-of-Experts Model 24.10.08 / ArXiv / Project Page	General QA / Caption	/
LLaVA-Video @ByteDance	Video Instruction Tuning with Synthetic Data 24.10.03 / ArXiv / Project Page	General QA / Caption	/
Oryx @THU	Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution 24.09.19 / ArXiv / Project Page	General QA / Caption	/
Qwen2-VL @Qwen	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution 24..09.18 / ArXiv / Project Page	General QA / Caption	/
LLaVA-OneVision @ByteDance	LLaVA-OneVision: Easy Visual Task Transfer 24.08.06 / ArXiv / Project Page	General QA / Caption	/
InternVL-2 @OpenGVLab	InternVL2: Better than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy 24.07.04 / Blog / Project Page	General QA / Caption	/
VideoLLaMA @Alibaba	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs 24.06.11 / ArXiv / Project Page	General QA / Caption	/
LWM @Berkeley	World Model on Million-Length Video And Language With Blockwise RingAttention 24.02.13 / ArXiv / Project Page	General QA	/
VILA @Nvidia	VILA: On Pre-training for Visual Language Models 23.12.12 / CVPR'24 / Project Page	General QA / Caption	/

Agents

Name	Paper	Task	Note
TraveLER @Berkeley	TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering 24.04.01 / EMNLP'24 / Project Page	QA	/
VideoAgent @BIGAI	VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding 24.03.18 / ECCV'24 / Project Page	QA / Temporal Grounding	/
VideoAgent @Stanford	VideoAgent: Long-form Video Understanding with Large Language Model as Agent 24.03.15 / ECCV'24 / Project Page	QA	/

Benchmarks

General QA

Name	Paper	Metadata	Note
HourVideo @Stanford	HourVideo: 1-Hour Video-Language Understanding 24.11.07 / NIPS'24 D&B / Project Page	LLM+Human Annotated / 500 videos / 20~120m / 13K QAs	Long / Egocentric
TOMATO @Yale	TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models 24.10.31 / ArXiv / Project Page	Human Annotated / 1.4K videos / 0~72s / 1.5K QAs	/
TemporalBench @UWM	TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models 24.10.14 / ArXiv / Project Page	Human+LLM Annotated / 2K videos / 0~20m / 10K QAs	/
LongVideoBench @NTU	LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding 24.07.22 / NIPS'24 D&B / Project Page	Humman Annotated / 3.8K videos / 0~1h / 6.7K QAs	/
LVBench @Zhipu	LVBench: An Extreme Long Video Understanding Benchmark 24.06.12 / ArXiv / Project Page	Humman Annotated / 500 videos / avg. 1h / 1.5k QAs	/
VideoMME @VideoMME-Team	Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis 24.05.31 / ArXiv / Project Page	Humman Annotated / 900 videos / 0~60m / 2.7K QAs	/
TempCompass @PKU	TempCompass: Do Video LLMs Really Understand Videos? 24.03.01 / ACL'24 Findings / Project Page	ChatGPT+Human Annotated / 410 videos / 0~35s / 7.5K QAs	/

Caption

Name	Paper	Metadata	Note
VDC @UW	AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark 24.10.04 / ArXiv / Project Page	GPT-4o Annotated / 1027 videos / 0~60s / 1027 captions	Evaluate Captioning using QAs

Temporal Grounding

Name	Paper	Metadata	Note
QVHightlight @UNC	QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries 21.07.20 / NIPS'21 / Project Page	Human Annotated / 10K videos / avg. 150s / 10K queries	/
Charades-STA @USC	TALL: Temporal Activity Localization via Language Query 17.05.05 / ICCV'17 / Project Page	Rule+Human Annotated / 4233 clip-sentence pairs	/
ActivityNet Captions @Stanford	Dense-Captioning Events in Videos 17.05.05 / ICCV'17 / Project Page	Human Annotated / 20K videos / 0~270s	/
YouCook2 @Google Brain	Towards Automatic Learning of Procedures from Web Instructional Videos 17.03.28 / AAAI'18 / Project Page	Human Annotated / 2K videos / 0~800s / avg. 7.7 segments per video	/

Action Recognition

Name	Paper	Metadata	Note
FineGym @CUHK	FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding 20.04.14 / CVPR'20 / Project Page	Human Annotated	/

Hallucination

Name	Paper	Metadata	Note
VideoHallucer @BIGAI	VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models 24.06.24 / ArXiv / Project Page	Rule+Human Annotated / 948 videos / 7~187s / 1.8K QAs	/

Datasets

Pre-Training

Name	Paper	Data	Metadata
ShareGPTVideo @CMU	Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward 24.04.01 / ArXiv / Project Page	Dataset	GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions

Instruction-Tuning

Name	Paper	Data	Metadata
LLaVA-Video-178k @ByteDance	Video Instruction Tuning with Synthetic Data 24.10.03 / ArXiv / Project Page	Dataset	GPT-4o Annotated (1 FPS) / 178K videos / 0~3m / 178K Captions / 1.1M QAs
ShareGPTVideo @CMU	Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward 24.04.01 / ArXiv / Project Page	Dataset	GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions / 900K QAs

RLHF

Name	Paper	Data	Metadata
ShareGPTVideo @CMU	Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward 24.04.01 / ArXiv / Project Page	Dataset	ChatGPT Annotated / 17K videos / 17K preference data

Research Topics

Visual Encoding

Name	Paper	Note
ElasticTok @Berkeley	ElasticTok: Adaptive Tokenization for Image and Video 24.10.10 / ArXiv / Project Page	Visual Tokenizer
VideoPrism @Google	VideoPrism: A Foundational Visual Encoder for Video Understanding 24.02.20 / ICML'24 / Project Page	Video Encoder
MMVP @NYU	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs 24.01.11 / ArXiv / Project Page	Hybrid Encoder

Visual Token Reduction

Name	Paper	Note
RLT @CMU	Don't Look Twice: Faster Video Transformers with Run-Length Tokenization 24.11.07 / NIPS'24 / Project Page	Run-Length Tokenization
InTI @NJU	Dynamic and Compressive Adaptation of Transformers From Images to Videos 24.08.13 / ECCV'24 / Project Page	Dynamic Inter-frame token interpolation
Cambrian-1 @NYU	Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs 24.06.24 / ArXiv / Project Page	Spatial Vision Aggregator
FastV @PKU	An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models 24.03.11 / ECCV'24 / Project Page	Prune tokens after layer 2

Streaming

Name	Paper	Note
Streaming_VDC @Google	Streaming Dense Video Captioning 24.04.01 / CVPR'24 / Project Page	Framework

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-video-understanding

Models

Large Multimodal Models

Agents

Benchmarks

General QA

Caption

Temporal Grounding

Action Recognition

Hallucination

Datasets

Pre-Training

Instruction-Tuning

RLHF

Research Topics

Visual Encoding

Visual Token Reduction

Streaming

About

Releases

Packages

vvukimy/awesome-video-understanding

Folders and files

Latest commit

History

Repository files navigation

awesome-video-understanding

Models

Large Multimodal Models

Agents

Benchmarks

General QA

Caption

Temporal Grounding

Action Recognition

Hallucination

Datasets

Pre-Training

Instruction-Tuning

RLHF

Research Topics

Visual Encoding

Visual Token Reduction

Streaming

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages