-
Notifications
You must be signed in to change notification settings - Fork 107
Add InfiNet module for DiffusionOverDiffusion training to allow for extremely (minutes!) long video creation #27
base: main
Are you sure you want to change the base?
Conversation
This is great @kabachuha! Thanks for this PR, and sure we can get in touch. |
@kabachuha thanks for your contribution! |
@sergiobr hi, we have a some sort of a text2vodeo team on the Deforum discord server, join it :) https://discord.gg/deforum |
@ExponentialML training works, btw |
Great! Let me know if any you need any assistance getting things up to speed with the new repository changes. |
Yeah, I'd really appreciate help in carrying it over, since you know much better about the mainline changes |
By all means. Just let me know when it's ready to merge. If you don't want to resolve the conflicts yourself, I'm more than willing to do it 👍 . |
as mp4 often fails for such short videos
bump bump |
So, I'm going to write an automatic DoD captioner using OpenAI's (or other LLM provider, maybe local oobabooga). How it will work:
It eliminates the difficulty of forming the mid-level captions |
sooo any updates on this? |
bump |
Hi, Exponential-ML!
As you probably know, a bit more than a week ago, Microsoft published their paper where they described the novel DiffusionOverDiffusion technique https://arxiv.org/abs/2303.12346 working by firstly outlining the coarse keyframes and then picking a pair of them as starting points and filling in-betweens (with different, more local prompts!)
Using it they were able to tune on and create whole 11 minutes long Flintstones episodes https://www.reddit.com/r/StableDiffusion/comments/11zwaxx/microsofts_nuwaxl_creates_an_11_minute/
Seeing their impressive results, I couldn't have restrained myself from trying to replicate them.
Having read the article, I noticed that the model structure is extremely similar to the ModelScope one, and the only difference is the 'video conditioning' layer (in green), which information is being transferred into the preexisting U-net3D by a set of Conv-down cells.
Thanks to them using so called zero-convolutions I realized that layer as a ControlNet-like network https://github.com/kabachuha/InfiNet, with which it is possible to introduce the new layers without altering the work of the existing model. (See
DoDBlock
in the code)I already tested the inference with
diffusion_depth=0
anddiffusion_depth=1
(anydiffusion_depth>0
turns on the DoD-blocks), so when inferring the model definitely worksI'll start training experiments as soon as I'll figure out the dataset and the system requirements for it
P.S. @ExponentialML, contact me on Discord. I'd really appreciate more close communications