To evaluate the aesthetic quality of videos, we use the scoring model from CLIP+MLP Aesthetic Score Predictor. This model is trained on 176K SAC (Simulacra Aesthetic Captions) pairs, 15K LAION-Logos (Logos) pairs, and 250K AVA (The Aesthetic Visual Analysis) image-text pairs.
The aesthetic score is between 1 and 10, where 5.5 can be considered as the threshold for fair aesthetics, and 6.5 for high aesthetics. Good text-to-image models can achieve a score of 7.0 or higher.
For videos, we extract the first, last, and the middle frames for evaluation. The script also supports images as input. The throughput of our code is ~1K videos/s on a single H800 GPU. It also supports running on multiple GPUs for further acceleration.
First, install the required packages and download the scoring model to ./pretrained_models/aesthetic.pth
.
# pip install
pip install git+https://github.com/openai/CLIP.git
pip install decord
# get pretrained model
wget https://github.com/christophschuhmann/improved-aesthetic-predictor/raw/main/sac+logos+ava1-l14-linearMSE.pth -O pretrained_models/aesthetic.pth
Then, run the following command. Make sure the meta file has column path
(path to the sample).
torchrun --nproc_per_node 8 -m tools.scoring.aesthetic.inference /path/to/meta.csv --bs 1024 --num_workers 16
This will generate multiple part files, each corresponding to a node . Run python -m tools.datasets.datautil /path/to/meta_aes_part*.csv --output /path/to/meta_aes.csv
to merge them.
Optical flow scores are used to assess the motion of a video. Higher optical flow scores indicate larger movement. We use the UniMatch model for this task.
First, download the pretrained model to ./pretrained_model/unimatch/
wget https://s3.eu-central-1.amazonaws.com/avg-projects/unimatch/pretrained/gmflow-scale2-regrefine6-mixdata-train320x576-4e7b215d.pth -P ./pretrained_models/unimatch/
Then, run the following command. Make sure the meta file has column path
(path to the sample).
torchrun --standalone --nproc_per_node 8 tools/scoring/optical_flow/inference.py /path/to/meta.csv
This should output /path/to/meta_flow.csv
with column flow
.
Some videos are of dense text scenes like news broadcast and advertisement, which are not desired for training. We apply Optical Character Recognition (OCR) to detect texts and drop samples with dense texts. Here, we use the DBNet++ model implemented by MMOCR.
First, install MMOCR. For reference, we install packages of these versions.
torch==2.0.1
mmcv==2.0.1
mmdet==3.1.0
mmocr==1.0.1
Then, run the following command. Make sure the meta file has column path
(path to the sample).
torchrun --standalone --nproc_per_node 8 tools/scoring/ocr/inference.py /path/to/meta.csv
This should output /path/to/meta_ocr.csv
with column ocr
, indicating the number of text regions with detection confidence > 0.3.
Matching scores are calculated to evaluate the alignment between an image/video and its caption. Here, we use the CLIP model, which is trained on image-text pairs. We simply use the cosine similarity as the matching score. For videos, we extract the middle frame and compare it with the caption.
First, install OpenAI CLIP.
pip install git+https://github.com/openai/CLIP.git
Then, run the following command. Make sure the meta file has column path
(path to the sample) and text
(caption of the sample).
torchrun --standalone --nproc_per_node 8 tools/scoring/matching/inference.py /path/to/meta.csv
This should output /path/to/meta_match.csv
with column match
. Higher matching scores indicate better image-text/video-text alignment.
Once scores are obtained, it is simple to filter samples based on these scores. Here is an example to remove samples of aesthetic score < 5.0.
python -m tools.datasets.datautil /path/to/meta.csv --aesmin 5.0
This should output /path/to/meta_aesmin5.0.csv
with column aes
>= 5.0