FoleyCrafter

Sound effects are the unsung heroes of cinema and gaming, enhancing realism, impact, and emotional depth for an immersive audiovisual experience. FoleyCrafter is a video-to-audio generation framework which can produce realistic sound effects semantically relevant and synchronized with videos.

Your star is our fuel! We're revving up the engines with it!

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Yiming Zhang, Yicheng Gu, Yanhong Zeng†, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen†

(†Corresponding Author)

What's New

A more powerful one 😝 .
Release training code.
2024/07/01 Release the model and code of FoleyCrafter.

Setup

Prepare Environment

Use the following command to install dependencies:

# install conda environment
conda env create -f requirements/environment.yaml
conda activate foleycrafter

# install GIT LFS for checkpoints download
conda install git-lfs
git lfs install

Download Checkpoints

The checkpoints will be downloaded automatically by running inference.py.

You can also download manually using following commands.

Download the text-to-audio base model. We use Auffusion

git clone https://huggingface.co/auffusion/auffusion-full-no-adapter checkpoints/auffusion

Download FoleyCrafter

git clone https://huggingface.co/ymzhang319/FoleyCrafter checkpoints/

Put checkpoints as follows:

└── checkpoints
    ├── semantic
    │   ├── semantic_adapter.bin
    ├── vocoder
    │   ├── vocoder.pt
    │   ├── config.json
    ├── temporal_adapter.ckpt
    │   │
    └── timestamp_detector.pth.tar

Gradio demo

You can launch the Gradio interface for FoleyCrafter by running the following command:

python app.py --share

Inference

Video To Audio Generation

python inference.py --save_dir=output/sora/

Results:

Input Video	Generated Audio
0.mp4	0.mp4
1.mp4	1.mp4
2.mp4	2.mp4
3.mp4	3.mp4

Temporal Alignment with Visual Cues

python inference.py \
--temporal_align \
--input=input/avsync \
--save_dir=output/avsync/

Results:

Ground Truth	Generated Audio
0.mp4	0.mp4
1.mp4	1.mp4
2.mp4	2.mp4

Text-based Video to Audio Generation

Using Prompt

# case1
python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--save_dir=output/PromptControl/case1/

python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--prompt='noisy, people talking' \
--save_dir=output/PromptControl/case1_prompt/

# case2
python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--save_dir=output/PromptControl/case2/

python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--prompt='seagulls' \
--save_dir=output/PromptControl/case2_prompt/

Results:

Generated Audio	Generated Audio
Without Prompt	Prompt: noisy, people talking
0.mp4	0.mp4
Without Prompt	Prompt: seagulls
0.mp4	0.mp4

Using Negative Prompt

# case 3
python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--save_dir=output/PromptControl/case3/

python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--nprompt='river flows' \
--save_dir=output/PromptControl/case3_nprompt/

# case4
python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--save_dir=output/PromptControl/case4/

python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--nprompt='noisy, wind noise' \
--save_dir=output/PromptControl/case4_nprompt/

Results:

Generated Audio	Generated Audio
Without Prompt	Negative Prompt: river flows
0.mp4	0.mp4
Without Prompt	Negative Prompt: noisy, wind noise
0.mp4	0.mp4

Commandline Usage Parameters

options:
  -h, --help            show this help message and exit
  --prompt PROMPT       prompt for audio generation
  --nprompt NPROMPT     negative prompt for audio generation
  --seed SEED           ramdom seed
  --temporal_align TEMPORAL_ALIGN
                        use temporal adapter or not
  --temporal_scale TEMPORAL_SCALE
                        temporal align scale
  --semantic_scale SEMANTIC_SCALE
                        visual content scale
  --input INPUT         input video folder path
  --ckpt CKPT           checkpoints folder path
  --save_dir SAVE_DIR   generation result save path
  --pretrain PRETRAIN   generator checkpoint path
  --device DEVICE

BibTex

@misc{zhang2024pia,
  title={FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds},
  author={Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen},
  year={2024},
  eprint={2407.01494},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Contact Us

Yiming Zhang: [email protected]

YiCheng Gu: [email protected]

Yanhong Zeng: [email protected]

LICENSE

Please check Apache-2.0 license for details.

Acknowledgements

The code is built upon Auffusion, CondFoleyGen and SpecVQGAN.

We recommend a toolkit for Audio, Music, and Speech Generation Amphion 💝.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.cog/tmp		.cog/tmp
.github/workflows		.github/workflows
assets		assets
examples		examples
foleycrafter		foleycrafter
requirements		requirements
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
app.py		app.py
cog.yaml		cog.yaml
cuda-keyring_1.0-1_all.deb		cuda-keyring_1.0-1_all.deb
inference.py		inference.py
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FoleyCrafter

What's New

Setup

Prepare Environment

Download Checkpoints

Gradio demo

Inference

Video To Audio Generation

Text-based Video to Audio Generation

Commandline Usage Parameters

BibTex

Contact Us

LICENSE

Acknowledgements

About

Releases

Packages

Languages

License

darkzbaron/FoleyCrafter-cog

Folders and files

Latest commit

History

Repository files navigation

FoleyCrafter

What's New

Setup

Prepare Environment

Download Checkpoints

Gradio demo

Inference

Video To Audio Generation

Text-based Video to Audio Generation

Commandline Usage Parameters

BibTex

Contact Us

LICENSE

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages