Sound effects are the unsung heroes of cinema and gaming, enhancing realism, impact, and emotional depth for an immersive audiovisual experience. FoleyCrafter is a video-to-audio generation framework which can produce realistic sound effects semantically relevant and synchronized with videos.
Your star is our fuel! We're revving up the engines with it!
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
Yiming Zhang, Yicheng Gu, Yanhong Zeng†, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen†
(†Corresponding Author)
- A more powerful one 😝 .
- Release training code.
-
2024/07/01
Release the model and code of FoleyCrafter.
Use the following command to install dependencies:
# install conda environment
conda env create -f requirements/environment.yaml
conda activate foleycrafter
# install GIT LFS for checkpoints download
conda install git-lfs
git lfs install
The checkpoints will be downloaded automatically by running inference.py
.
You can also download manually using following commands.
git clone https://huggingface.co/auffusion/auffusion-full-no-adapter checkpoints/auffusion
git clone https://huggingface.co/ymzhang319/FoleyCrafter checkpoints/
Put checkpoints as follows:
└── checkpoints
├── semantic
│ ├── semantic_adapter.bin
├── vocoder
│ ├── vocoder.pt
│ ├── config.json
├── temporal_adapter.ckpt
│ │
└── timestamp_detector.pth.tar
You can launch the Gradio interface for FoleyCrafter by running the following command:
python app.py --share
python inference.py --save_dir=output/sora/
Results:
Input Video |
Generated Audio |
0.mp4 |
0.mp4 |
1.mp4 |
1.mp4 |
2.mp4 |
2.mp4 |
3.mp4 |
3.mp4 |
- Temporal Alignment with Visual Cues
python inference.py \
--temporal_align \
--input=input/avsync \
--save_dir=output/avsync/
Results:
Ground Truth |
Generated Audio |
0.mp4 |
0.mp4 |
1.mp4 |
1.mp4 |
2.mp4 |
2.mp4 |
- Using Prompt
# case1
python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--save_dir=output/PromptControl/case1/
python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--prompt='noisy, people talking' \
--save_dir=output/PromptControl/case1_prompt/
# case2
python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--save_dir=output/PromptControl/case2/
python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--prompt='seagulls' \
--save_dir=output/PromptControl/case2_prompt/
Results:
Generated Audio |
Generated Audio |
Without Prompt |
Prompt: noisy, people talking |
0.mp4 |
0.mp4 |
Without Prompt |
Prompt: seagulls |
0.mp4 |
0.mp4 |
- Using Negative Prompt
# case 3
python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--save_dir=output/PromptControl/case3/
python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--nprompt='river flows' \
--save_dir=output/PromptControl/case3_nprompt/
# case4
python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--save_dir=output/PromptControl/case4/
python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--nprompt='noisy, wind noise' \
--save_dir=output/PromptControl/case4_nprompt/
Results:
Generated Audio |
Generated Audio |
Without Prompt |
Negative Prompt: river flows |
0.mp4 |
0.mp4 |
Without Prompt |
Negative Prompt: noisy, wind noise |
0.mp4 |
0.mp4 |
options:
-h, --help show this help message and exit
--prompt PROMPT prompt for audio generation
--nprompt NPROMPT negative prompt for audio generation
--seed SEED ramdom seed
--temporal_align TEMPORAL_ALIGN
use temporal adapter or not
--temporal_scale TEMPORAL_SCALE
temporal align scale
--semantic_scale SEMANTIC_SCALE
visual content scale
--input INPUT input video folder path
--ckpt CKPT checkpoints folder path
--save_dir SAVE_DIR generation result save path
--pretrain PRETRAIN generator checkpoint path
--device DEVICE
@misc{zhang2024pia,
title={FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds},
author={Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen},
year={2024},
eprint={2407.01494},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Yiming Zhang: [email protected]
YiCheng Gu: [email protected]
Yanhong Zeng: [email protected]
Please check LICENSE for the part of FoleyCrafter for details. If you are using it for commercial purposes, please check the license of the Auffusion.
The code is built upon Auffusion, CondFoleyGen and SpecVQGAN.
We recommend a toolkit for Audio, Music, and Speech Generation Amphion 💝.