Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory requirements for full fine tune of SD 3.5 Large? #1118

Open
roblaughter opened this issue Nov 3, 2024 · 7 comments
Open

Memory requirements for full fine tune of SD 3.5 Large? #1118

roblaughter opened this issue Nov 3, 2024 · 7 comments

Comments

@roblaughter
Copy link

Trying to figure out memory requirements for fine tuning SD 3.5 Large. I spun up an L40 instance (48GB) but the script tried to allocate 96GB.

I'm new to the fine tuning world, so I'm not sure where to look next. H100s on Runpod max out at 94GB. Is fine tuning out of reach right now? Or would training on multiple GPUs divide up the resources? Are there any optimization strategies that I'm missing?

A point in the right direction would be greatly appreciated 🙏

@bghira
Copy link
Owner

bghira commented Nov 3, 2024

you probably didn't enable gradient checkpointing, but it's hard to know where to begin w/o a config file

@roblaughter
Copy link
Author

it's hard to know where to begin w/o a config file

My bad. Gradient checkpointing is set to true.

config.json

{
    "--resume_from_checkpoint": "latest",
    "--data_backend_config": "config/multidatabackend.json",
    "--aspect_bucket_rounding": 2,
    "--seed": 42,
    "--minimum_image_size": 0,
    "--disable_benchmark": false,
    "--output_dir": "output/models",
    "--max_train_steps": 25000,
    "--num_train_epochs": 0,
    "--checkpointing_steps": 1000,
    "--checkpoints_total_limit": 5,
    "--hub_model_id": "vintage-film",
    "--push_to_hub": "true",
    "--push_checkpoints_to_hub": "true",
    "--model_card_safe_for_work": "true",
    "--tracker_project_name": "film-fine-tune",
    "--tracker_run_name": "test-1",
    "--report_to": "wandb",
    "--model_type": "full",
    "--pretrained_model_name_or_path": "stabilityai/stable-diffusion-3.5-large",
    "--model_family": "sd3",
    "--train_batch_size": 2,
    "--gradient_checkpointing": "true",
    "--caption_dropout_probability": 0.2,
    "--resolution_type": "pixel_area",
    "--resolution": 1024,
    "--validation_seed": 42,
    "--validation_steps": "500",
    "--validation_resolution": "1024x1024",
    "--validation_guidance": 5.0,
    "--validation_guidance_rescale": "0.0",
    "--validation_num_inference_steps": "40",
    "--validation_prompt": "a 35 year old British food critic exploring a narrow winding street in London",
    "--mixed_precision": "bf16",
    "--optimizer": "adamw_bf16",
    "--learning_rate": "5e-5",
    "--lr_scheduler": "polynomial",
    "--lr_warmup_steps": 100,
    "--base_model_precision": "no_change",
    "--validation_torch_compile": "false"
}

multidatabackend.json

[
    {
        "id": "film_photos-512",
        "type": "local",
        "instance_data_dir": "/workspace/film_photos",
        "crop": false,
        "crop_style": "random",
        "minimum_image_size": 128,
        "resolution": 512,
        "resolution_type": "pixel_area",
        "repeats": 5,
        "metadata_backend": "discovery",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-512"
    },
    {
        "id": "film_photos-1024",
        "type": "local",
        "instance_data_dir": "/workspace/film_photos",
        "crop": false,
        "crop_style": "random",
        "minimum_image_size": 128,
        "resolution": 1024,
        "resolution_type": "pixel_area",
        "repeats": 5,
        "metadata_backend": "discovery",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-1024"
    },
    {
        "id": "film_photos-512-crop",
        "type": "local",
        "instance_data_dir": "/workspace/film_photos",
        "crop": true,
        "crop_style": "random",
        "minimum_image_size": 128,
        "resolution": 512,
        "resolution_type": "pixel_area",
        "repeats": 5,
        "metadata_backend": "discovery",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-512-crop"
    },
    {
        "id": "film_photos-1024-crop",
        "type": "local",
        "instance_data_dir": "/workspace/film_photos",
        "crop": true,
        "crop_style": "random",
        "minimum_image_size": 128,
        "resolution": 1024,
        "resolution_type": "pixel_area",
        "repeats": 5,
        "metadata_backend": "discovery",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-1024-crop"
    },
    {
        "id": "text-embed-cache",
        "dataset_type": "text_embeds",
        "default": true,
        "type": "local",
        "cache_dir": "cache//text"
    }
]

@bghira
Copy link
Owner

bghira commented Nov 3, 2024

chances are you'll need to follow the DeepSpeed guide to enable full-rank training on the 8B model. i think otherwise it wants about 110-130GB of memory for everything (weights, optim states, gradients)

@roblaughter
Copy link
Author

chances are you'll need to follow the DeepSpeed guide to enable full-rank training on the 8B model

On it. Giving it a go now.

@roblaughter
Copy link
Author

roblaughter commented Nov 3, 2024

Still struggling... If you can, help explain like I'm a noob. Because I am.

  1. I enabled DeepSpeed level 1 on the L40 48GB. It OOMed. Said it was trying to allocate something like 15 GB, 40-ish were in use.
  2. Upped it to level 2. Still OOMed.
  3. Switched to a A100 80GB. Enabled DeepSpeed level 1. Still OOMed.
  4. Upped it to level 2. Still OOMed.

Tried to allocate 15.01 GiB. GPU 0 has a total capacity of 79.26 GiB of which 2.91 GiB is free. Process 2719587 has 76.34 GiB memory in use. Of the allocated memory 60.25 GiB is allocated by PyTorch, and 15.01 GiB is reserved by PyTorch but unallocated.

It seems like no matter how much VRAM I throw at it, it wants mooooore.

Any ideas on how to push past that?

EDIT: Tried DeepSpeed level 3 on 80GB and still got this:

Tried to allocate 15.01 GiB. GPU 0 has a total capacity of 79.26 GiB of which 2.91 GiB is free. Process 2734336 has 76.34 GiB memory in use. Of the allocated memory 60.25 GiB is allocated by PyTorch, and 15.01 GiB is reserved by PyTorch but unallocated.

@bghira
Copy link
Owner

bghira commented Nov 3, 2024

it sounds like deepspeed may not be properly enabled then, because it definitely works on a single 80G card with level 2 for Flux's 12B params - that uses just 73G VRAM.

@roblaughter
Copy link
Author

roblaughter commented Nov 3, 2024

I found the problem.

I set a HF_HOME directory to my network volume so I wouldn't have to keep downloading models every time I booted the server.

It's saving here:

accelerate configuration saved at /workspace/cache/accelerate/default_config.yaml

But loading from here:

Using Accelerate config file: /root/.cache/huggingface/accelerate/default_config.yaml

Setting ACCELERATE_CONFIG_PATH solved it, and training is rolling. Thanks!

Coming in at just under 60GB...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants