Memory requirements for full fine tune of SD 3.5 Large? #1118

roblaughter · 2024-11-03T05:27:16Z

Trying to figure out memory requirements for fine tuning SD 3.5 Large. I spun up an L40 instance (48GB) but the script tried to allocate 96GB.

I'm new to the fine tuning world, so I'm not sure where to look next. H100s on Runpod max out at 94GB. Is fine tuning out of reach right now? Or would training on multiple GPUs divide up the resources? Are there any optimization strategies that I'm missing?

A point in the right direction would be greatly appreciated 🙏

bghira · 2024-11-03T12:25:40Z

you probably didn't enable gradient checkpointing, but it's hard to know where to begin w/o a config file

roblaughter · 2024-11-03T18:17:54Z

it's hard to know where to begin w/o a config file

My bad. Gradient checkpointing is set to true.

config.json

{
    "--resume_from_checkpoint": "latest",
    "--data_backend_config": "config/multidatabackend.json",
    "--aspect_bucket_rounding": 2,
    "--seed": 42,
    "--minimum_image_size": 0,
    "--disable_benchmark": false,
    "--output_dir": "output/models",
    "--max_train_steps": 25000,
    "--num_train_epochs": 0,
    "--checkpointing_steps": 1000,
    "--checkpoints_total_limit": 5,
    "--hub_model_id": "vintage-film",
    "--push_to_hub": "true",
    "--push_checkpoints_to_hub": "true",
    "--model_card_safe_for_work": "true",
    "--tracker_project_name": "film-fine-tune",
    "--tracker_run_name": "test-1",
    "--report_to": "wandb",
    "--model_type": "full",
    "--pretrained_model_name_or_path": "stabilityai/stable-diffusion-3.5-large",
    "--model_family": "sd3",
    "--train_batch_size": 2,
    "--gradient_checkpointing": "true",
    "--caption_dropout_probability": 0.2,
    "--resolution_type": "pixel_area",
    "--resolution": 1024,
    "--validation_seed": 42,
    "--validation_steps": "500",
    "--validation_resolution": "1024x1024",
    "--validation_guidance": 5.0,
    "--validation_guidance_rescale": "0.0",
    "--validation_num_inference_steps": "40",
    "--validation_prompt": "a 35 year old British food critic exploring a narrow winding street in London",
    "--mixed_precision": "bf16",
    "--optimizer": "adamw_bf16",
    "--learning_rate": "5e-5",
    "--lr_scheduler": "polynomial",
    "--lr_warmup_steps": 100,
    "--base_model_precision": "no_change",
    "--validation_torch_compile": "false"
}

multidatabackend.json

[
    {
        "id": "film_photos-512",
        "type": "local",
        "instance_data_dir": "/workspace/film_photos",
        "crop": false,
        "crop_style": "random",
        "minimum_image_size": 128,
        "resolution": 512,
        "resolution_type": "pixel_area",
        "repeats": 5,
        "metadata_backend": "discovery",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-512"
    },
    {
        "id": "film_photos-1024",
        "type": "local",
        "instance_data_dir": "/workspace/film_photos",
        "crop": false,
        "crop_style": "random",
        "minimum_image_size": 128,
        "resolution": 1024,
        "resolution_type": "pixel_area",
        "repeats": 5,
        "metadata_backend": "discovery",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-1024"
    },
    {
        "id": "film_photos-512-crop",
        "type": "local",
        "instance_data_dir": "/workspace/film_photos",
        "crop": true,
        "crop_style": "random",
        "minimum_image_size": 128,
        "resolution": 512,
        "resolution_type": "pixel_area",
        "repeats": 5,
        "metadata_backend": "discovery",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-512-crop"
    },
    {
        "id": "film_photos-1024-crop",
        "type": "local",
        "instance_data_dir": "/workspace/film_photos",
        "crop": true,
        "crop_style": "random",
        "minimum_image_size": 128,
        "resolution": 1024,
        "resolution_type": "pixel_area",
        "repeats": 5,
        "metadata_backend": "discovery",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-1024-crop"
    },
    {
        "id": "text-embed-cache",
        "dataset_type": "text_embeds",
        "default": true,
        "type": "local",
        "cache_dir": "cache//text"
    }
]

bghira · 2024-11-03T18:19:53Z

chances are you'll need to follow the DeepSpeed guide to enable full-rank training on the 8B model. i think otherwise it wants about 110-130GB of memory for everything (weights, optim states, gradients)

roblaughter · 2024-11-03T18:34:54Z

chances are you'll need to follow the DeepSpeed guide to enable full-rank training on the 8B model

On it. Giving it a go now.

roblaughter · 2024-11-03T19:32:11Z

Still struggling... If you can, help explain like I'm a noob. Because I am.

I enabled DeepSpeed level 1 on the L40 48GB. It OOMed. Said it was trying to allocate something like 15 GB, 40-ish were in use.
Upped it to level 2. Still OOMed.
Switched to a A100 80GB. Enabled DeepSpeed level 1. Still OOMed.
Upped it to level 2. Still OOMed.

Tried to allocate 15.01 GiB. GPU 0 has a total capacity of 79.26 GiB of which 2.91 GiB is free. Process 2719587 has 76.34 GiB memory in use. Of the allocated memory 60.25 GiB is allocated by PyTorch, and 15.01 GiB is reserved by PyTorch but unallocated.

It seems like no matter how much VRAM I throw at it, it wants mooooore.

Any ideas on how to push past that?

EDIT: Tried DeepSpeed level 3 on 80GB and still got this:

Tried to allocate 15.01 GiB. GPU 0 has a total capacity of 79.26 GiB of which 2.91 GiB is free. Process 2734336 has 76.34 GiB memory in use. Of the allocated memory 60.25 GiB is allocated by PyTorch, and 15.01 GiB is reserved by PyTorch but unallocated.

bghira · 2024-11-03T20:00:00Z

it sounds like deepspeed may not be properly enabled then, because it definitely works on a single 80G card with level 2 for Flux's 12B params - that uses just 73G VRAM.

roblaughter · 2024-11-03T20:32:20Z

I found the problem.

I set a HF_HOME directory to my network volume so I wouldn't have to keep downloading models every time I booted the server.

It's saving here:

accelerate configuration saved at /workspace/cache/accelerate/default_config.yaml

But loading from here:

Using Accelerate config file: /root/.cache/huggingface/accelerate/default_config.yaml

Setting ACCELERATE_CONFIG_PATH solved it, and training is rolling. Thanks!

Coming in at just under 60GB...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory requirements for full fine tune of SD 3.5 Large? #1118

Memory requirements for full fine tune of SD 3.5 Large? #1118

roblaughter commented Nov 3, 2024

bghira commented Nov 3, 2024

roblaughter commented Nov 3, 2024

bghira commented Nov 3, 2024

roblaughter commented Nov 3, 2024

roblaughter commented Nov 3, 2024 •

edited

Loading

bghira commented Nov 3, 2024

roblaughter commented Nov 3, 2024 •

edited

Loading

Memory requirements for full fine tune of SD 3.5 Large? #1118

Memory requirements for full fine tune of SD 3.5 Large? #1118

Comments

roblaughter commented Nov 3, 2024

bghira commented Nov 3, 2024

roblaughter commented Nov 3, 2024

bghira commented Nov 3, 2024

roblaughter commented Nov 3, 2024

roblaughter commented Nov 3, 2024 • edited Loading

bghira commented Nov 3, 2024

roblaughter commented Nov 3, 2024 • edited Loading

roblaughter commented Nov 3, 2024 •

edited

Loading

roblaughter commented Nov 3, 2024 •

edited

Loading