You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.
Hey, thanks for open-sourcing this code!
I had a quick question about the finetune_unet function in train.py: why are there two forward passes and loss computations through the unet?
Is it to implement some sort of self-conditioning which I've read about in some text-to-image diffusion papers (or could you point to the part of the paper that corresponds to)?
Thanks!
The text was updated successfully, but these errors were encountered:
Hi, and thanks! Just to clarify, I'm not the original author of the code.
The two forward passes are for to the text to image training part if the user chooses to do so. In my tests, training the text encoder alongside video data (temporal information) does not bode well, whereas sampling a single frame works much, much better.
An alternative is to concatenate the frames along the frames on the batch dimension into the text encoder, but that would require a significant memory footprint.
Hope that clears it up!
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hey, thanks for open-sourcing this code!
I had a quick question about the
finetune_unet
function intrain.py
: why are there two forward passes and loss computations through the unet?Is it to implement some sort of self-conditioning which I've read about in some text-to-image diffusion papers (or could you point to the part of the paper that corresponds to)?
Thanks!
The text was updated successfully, but these errors were encountered: