Skip to content

A curated list of resources (paper, code, data) on video understanding research.

Notifications You must be signed in to change notification settings

vvukimy/awesome-video-understanding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 

Repository files navigation

awesome-video-understanding Awesome

A curated list of resources (paper, code, data) on video understanding research. (sorted by release date)

🚀 This repo will be continuously updated.
⭐️ Please Star it if you find it helpful!
🤝 Feel free to submit a PR or open an issue with suggestions or improvements.


Table of Contents


Models

Large Multimodal Models

Name Paper Task Note
LongVU
@Meta
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
24.10.22 / ArXiv / Project Page
General QA /
Aria
@Rhymes AI
Aria: An Open Multimodal Native Mixture-of-Experts Model
24.10.08 / ArXiv / Project Page
General QA / Caption /
LLaVA-Video
@ByteDance
Video Instruction Tuning with Synthetic Data
24.10.03 / ArXiv / Project Page
General QA / Caption /
Oryx
@THU
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
24.09.19 / ArXiv / Project Page
General QA / Caption /
Qwen2-VL
@Qwen
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
24..09.18 / ArXiv / Project Page
General QA / Caption /
LLaVA-OneVision
@ByteDance
LLaVA-OneVision: Easy Visual Task Transfer
24.08.06 / ArXiv / Project Page
General QA / Caption /
InternVL-2
@OpenGVLab
InternVL2: Better than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy
24.07.04 / Blog / Project Page
General QA / Caption /
VideoLLaMA
@Alibaba
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
24.06.11 / ArXiv / Project Page
General QA / Caption /
LWM
@Berkeley
World Model on Million-Length Video And Language With Blockwise RingAttention
24.02.13 / ArXiv / Project Page
General QA /
VILA
@Nvidia
VILA: On Pre-training for Visual Language Models
23.12.12 / CVPR'24 / Project Page
General QA / Caption /

Agents

Name Paper Task Note
TraveLER
@Berkeley
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
24.04.01 / EMNLP'24 / Project Page
QA /
VideoAgent
@BIGAI
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
24.03.18 / ECCV'24 / Project Page
QA / Temporal Grounding /
VideoAgent
@Stanford
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
24.03.15 / ECCV'24 / Project Page
QA /

Benchmarks

General QA

Name Paper Metadata Note
HourVideo
@Stanford
HourVideo: 1-Hour Video-Language Understanding
24.11.07 / NIPS'24 D&B / Project Page
LLM+Human Annotated / 500 videos / 20~120m / 13K QAs Long / Egocentric
TOMATO
@Yale
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
24.10.31 / ArXiv / Project Page
Human Annotated / 1.4K videos / 0~72s / 1.5K QAs /
TemporalBench
@UWM
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
24.10.14 / ArXiv / Project Page
Human+LLM Annotated / 2K videos / 0~20m / 10K QAs /
LongVideoBench
@NTU
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
24.07.22 / NIPS'24 D&B / Project Page
Humman Annotated / 3.8K videos / 0~1h / 6.7K QAs /
LVBench
@Zhipu
LVBench: An Extreme Long Video Understanding Benchmark
24.06.12 / ArXiv / Project Page
Humman Annotated / 500 videos / avg. 1h / 1.5k QAs /
VideoMME
@VideoMME-Team
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
24.05.31 / ArXiv / Project Page
Humman Annotated / 900 videos / 0~60m / 2.7K QAs /
TempCompass
@PKU
TempCompass: Do Video LLMs Really Understand Videos?
24.03.01 / ACL'24 Findings / Project Page
ChatGPT+Human Annotated / 410 videos / 0~35s / 7.5K QAs /

Caption

Name Paper Metadata Note
VDC
@UW
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
24.10.04 / ArXiv / Project Page
GPT-4o Annotated / 1027 videos / 0~60s / 1027 captions Evaluate Captioning using QAs

Temporal Grounding

Name Paper Metadata Note
QVHightlight
@UNC
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
21.07.20 / NIPS'21 / Project Page
Human Annotated / 10K videos / avg. 150s / 10K queries /
Charades-STA
@USC
TALL: Temporal Activity Localization via Language Query
17.05.05 / ICCV'17 / Project Page
Rule+Human Annotated / 4233 clip-sentence pairs /
ActivityNet Captions
@Stanford
Dense-Captioning Events in Videos
17.05.05 / ICCV'17 / Project Page
Human Annotated / 20K videos / 0~270s /
YouCook2
@Google Brain
Towards Automatic Learning of Procedures from Web Instructional Videos
17.03.28 / AAAI'18 / Project Page
Human Annotated / 2K videos / 0~800s / avg. 7.7 segments per video /

Action Recognition

Name Paper Metadata Note
FineGym
@CUHK
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
20.04.14 / CVPR'20 / Project Page
Human Annotated /

Hallucination

Name Paper Metadata Note
VideoHallucer
@BIGAI
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
24.06.24 / ArXiv / Project Page
Rule+Human Annotated / 948 videos / 7~187s / 1.8K QAs /

Datasets

Pre-Training

Name Paper Data Metadata
ShareGPTVideo
@CMU
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
24.04.01 / ArXiv / Project Page
Dataset GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions

Instruction-Tuning

Name Paper Data Metadata
LLaVA-Video-178k
@ByteDance
Video Instruction Tuning with Synthetic Data
24.10.03 / ArXiv / Project Page
Dataset GPT-4o Annotated (1 FPS) / 178K videos / 0~3m / 178K Captions / 1.1M QAs
ShareGPTVideo
@CMU
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
24.04.01 / ArXiv / Project Page
Dataset GPT-4V Annotated (10 frames) / 900K Videos / 900K Captions / 900K QAs

RLHF

Name Paper Data Metadata
ShareGPTVideo
@CMU
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
24.04.01 / ArXiv / Project Page
Dataset ChatGPT Annotated / 17K videos / 17K preference data

Research Topics

Visual Encoding

Name Paper Note
ElasticTok
@Berkeley
ElasticTok: Adaptive Tokenization for Image and Video
24.10.10 / ArXiv / Project Page
Visual Tokenizer
VideoPrism
@Google
VideoPrism: A Foundational Visual Encoder for Video Understanding
24.02.20 / ICML'24 / Project Page
Video Encoder
MMVP
@NYU
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
24.01.11 / ArXiv / Project Page
Hybrid Encoder

Visual Token Reduction

Name Paper Note
RLT
@CMU
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
24.11.07 / NIPS'24 / Project Page
Run-Length Tokenization
InTI
@NJU
Dynamic and Compressive Adaptation of Transformers From Images to Videos
24.08.13 / ECCV'24 / Project Page
Dynamic Inter-frame token interpolation
Cambrian-1
@NYU
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
24.06.24 / ArXiv / Project Page
Spatial Vision Aggregator
FastV
@PKU
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
24.03.11 / ECCV'24 / Project Page
Prune tokens after layer 2

Streaming

Name Paper Note
Streaming_VDC
@Google
Streaming Dense Video Captioning
24.04.01 / CVPR'24 / Project Page
Framework

About

A curated list of resources (paper, code, data) on video understanding research.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published