Skip to content

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Notifications You must be signed in to change notification settings

snap-research/Panda-70M

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🐼 Panda-70M

This is the offical Github repository of Panda-70M.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov
Computer Vision and Pattern Recognition (CVPR) 2024

arXiv Project Page YouTube

Introduction

Panda-70M is a large-scale dataset with 70M high-quality video-caption pairs. This repository have three sections:

  • Dataset Dataloading includes the csv files listing the data of Panda-70M and the code to download the dataset.
  • Splitting includes the code to split a long video into multiple semantics-consistent short clips.
  • Captioning includes the proposed video captioning model trained on Panda-70M.

🔥 Updates (Oct 2024)

To enhance the training of video generation models, which are intereted at single-shot videos with meaningful motion and aesthetically pleasing scenes, we introduce two additional annotations:

  • Desirability Filtering: This annotation assesses whether a video is a suitable training sample. We categorize videos into six groups based on their characteristics: desirable, 0_low_desirable_score, 1_still_foreground_image, 2_tiny_camera_movement, 3_screen_in_screen, 4_computer_screen_recording. In the below table, we present examples for each category along with the percentage of videos within the dataset.
  • Shot Boundary Detection: This annotation provides a list of intervals representing continuous shots within a video (predicted by TransNetV2). If the length of the list is one, it indicates the video consists of a single continuous shot without any shot boundaries.
desirable (80.5%) 0_low_desirable_score (5.28%) 1_still_foreground_image (6.82%)
2_tiny_camera_movement (1.20%) 3_screen_in_screen (5.03%) 4_computer_screen_recording (1.13%)
**We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

Dataset

Collection Pipeline

Download

Split Download # Source Videos # Samples Video Duration Storage Space
Training (full) link (2.73 GB) 3,779,763 70,723,513 167 khrs ~36 TB
Training (10M) link (504 MB) 3,755,240 10,473,922 37.0 khrs ~8.0 TB
Training (2M) link (118 MB) 800,000 2,400,000 7.56 khrs ~1.6 TB
Validation link (1.2 MB) 2,000 6,000 18.5 hrs ~4.0 GB
Testing link (1.2 MB) 2,000 6,000 18.5 hrs ~4.0 GB

More details can be found in Dataset Dataloading section.

Demonstration

Video-Caption Pairs in Panda-70M

A rhino and a lion are fighting in the dirt. A person is holding a long haired dachshund in their arms. A rocket launches into space on the launch pad.
A person is kneading dough and putting jam on it. A little boy is playing with a basketball in the city. A 3d rendering of a zoo with animals and a train.
A person in blue gloves is connecting an electrical supply to an injector. There is a beach with waves and rocks in the foreground, and a city skyline in the background. It is a rally car driving on a dirt road in the countryside, with people watching from the side of the road.

**We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.

Please check here for more samples.

Long Video Splitting and Captioning

long_video_demo_1.mp4
long_video_demo_2.mp4

License of Panda-70M

See license. The video samples are collected from a publicly available dataset. Users must follow the related license to use these video samples.

Citation

If you find this project useful for your research, please cite our paper. 😊

@inproceedings{chen2024panda70m,
  title     = {Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers},
  author    = {Chen, Tsai-Shien and Siarohin, Aliaksandr and Menapace, Willi and Deyneka, Ekaterina and Chao, Hsiang-wei and Jeon, Byung Eun and Fang, Yuwei and Lee, Hsin-Ying and Ren, Jian and Yang, Ming-Hsuan and Tulyakov, Sergey},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2024}
}

Contact Information

Tsai-Shien Chen: [email protected]

About

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages