-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory requirements for full fine tune of SD 3.5 Large? #1118
Comments
you probably didn't enable gradient checkpointing, but it's hard to know where to begin w/o a config file |
My bad. Gradient checkpointing is set to true. config.json
multidatabackend.json
|
chances are you'll need to follow the DeepSpeed guide to enable full-rank training on the 8B model. i think otherwise it wants about 110-130GB of memory for everything (weights, optim states, gradients) |
On it. Giving it a go now. |
Still struggling... If you can, help explain like I'm a noob. Because I am.
It seems like no matter how much VRAM I throw at it, it wants mooooore. Any ideas on how to push past that? EDIT: Tried DeepSpeed level 3 on 80GB and still got this:
|
it sounds like deepspeed may not be properly enabled then, because it definitely works on a single 80G card with level 2 for Flux's 12B params - that uses just 73G VRAM. |
I found the problem. I set a HF_HOME directory to my network volume so I wouldn't have to keep downloading models every time I booted the server. It's saving here:
But loading from here:
Setting ACCELERATE_CONFIG_PATH solved it, and training is rolling. Thanks! Coming in at just under 60GB... |
Trying to figure out memory requirements for fine tuning SD 3.5 Large. I spun up an L40 instance (48GB) but the script tried to allocate 96GB.
I'm new to the fine tuning world, so I'm not sure where to look next. H100s on Runpod max out at 94GB. Is fine tuning out of reach right now? Or would training on multiple GPUs divide up the resources? Are there any optimization strategies that I'm missing?
A point in the right direction would be greatly appreciated 🙏
The text was updated successfully, but these errors were encountered: