kohya_ss training problem - driving insane #1702

Blaisegf · 2023-11-23T14:14:17Z

Blaisegf
Nov 23, 2023

I'm trying to train a Lora using kohya_ss. I'm using a computer with Windows 11, equipped with a 4090 GPU, an i9-13900KS processor (3.2 GHz), and a whopping 128GB of RAM (yes, I know, that number is absurd). My model is realistic, and I have a lot of photos available, taken from different angles and under different lighting conditions. The photos are in 4k, and I've resized them.

I've tried captions like BLIP, captions like Danbooru, I've tried resizing to 1024, and also resizing to 512. I've tried with many images and few repeats, I've tried the opposite, I've tried with 1 epoch as well as 6 epochs and 15 epochs. I've tried a lot of different configurations, and I've come to the same point: absolute nothingness, even after 10 different trainings with different configurations each time. My Loras are producing nothing. If I input a prompt with my trigger word and my Lora, there's almost no difference compared to the same prompt without the Lora. It's as if it's training on emptiness. I've obviously tried varying the learning rates, varying the network alpha, varying the optimizer, and changing the model (usually I try to train on "analogmadness").

Someone even sent me their own folders and captions with a more "anime" style for me to try on my end, and the result was similar: pure emptiness. Except for reinstalling kohya_ss, I've literally tried everything, at least everything that came to mind.

I did notice something, though: when I use the utility/verify Lora function of kohya_ss, something surprising happens... but I have no idea if it's relevant. My "number of Lora modules" is always 528, regardless of the Loras I put in that I've trained. However, when I test those I download from Civitai, they all vary in the number of modules. I don't know if that plays a role, but I have no other leads.

Also I always use Dadaptadam because the 8bit ones doesn't work on my computer for some reason.

Please, I beg that someone here has the miraculous solution because honestly, I really want to invest seriously in this, and it's starting to feel hopeless.

CCpt5 · 2023-11-25T21:05:42Z

CCpt5
Nov 25, 2023

I don't necessarily have the answer for you, but since no one else has replied: Have you tried pulling a previous commit to see if an older version works? I don't do a lot of LoRA stuff aside from extracting LoRAs from model checkpoints I've trained (I find that works best). The tool I use to do the extraction is kohya's script in this repo. For some reason it stopped working at some point and the extractions did what you say where they just don't trigger anything. So around the time that happened I rolled back to commit 3b83a1c (Oct 1st) and the extraction script worked again....

So that could be related to your issue. If you haven't and are familiar w/ checking out an older commit I'd recommend doing that. Curiously the actual LoRA extraction script I ended up keeping in that folder is from 8/22/23 - right before SDXL was merged into the main branch (I used the dev branch prior, so that's prob why). I may have actually rolled that folder forward also (See I have 2 or 3 git clones of kohya). It should work if you pull a commit around when SDXL training was merged into the main branch on 8/31/23 (633bb8d - commit: 633bb8d)

ChatGPT can guide you on how to checkout a particular commit (all listed in date order here: https://github.com/bmaltais/kohya_ss/commits/master). It'd probably be best to create a totally new git clone in a new folder just so if it doesn't work you can just delete the attempt.

Basically:

Open a terminal in the root folder
activate the env ( .\venv\scripts\activate)
git checkout 633bb8d
Test it

Hope maybe that helps - should at least help narrow down what the problem is.

BTW have the same resourcess :p - although dumb mobo (asus proarts 690) won't post w/ all 4 RAM sticks in - pretty ridiculous.....64gb works for me though....

0 replies

Vendaciousness · 2024-04-05T21:20:07Z

Vendaciousness
Apr 5, 2024

Also I always use Dadaptadam because the 8bit ones doesn't work on my computer for some reason.

Please, I beg that someone here has the miraculous solution because honestly, I really want to invest seriously in this, and it's starting to feel hopeless.

It's true that text captioning is more important than I once thought, but for personal use, I used to always use just a single trigger word that's unique (not linked to any vector info in the model)? You know, like 'MyAnim3L0R4'? You can try that in place of a regular descriptive text, then reference the trigger word and see if that works any different. It always worked for me. You may also have one of the settings that re-calibrates a model's strength, so using 1.0 won't blow out the image every time. This may be enabled. I know the main one was at the bottom of the first wall of settings in the GUI.

If your concern is good captions, but you hate writing and editing each one (like me), check out TagUI. Not the one that comes up in EVERY Google search, but rather this one: https://github.com/jhc13/taggui

It is so much better than all the other captioning tools and models, like BLIP2 can't compare to the newer ones they have, it's crazy everyone isn't using it. The descriptions are always way better than anything I'd write by hand, with no extra flowery text like a ChatGPT answer and no missing details.

Also. you mention 'always' using the same setting because it was all that runs without error, even though you're getting bad results, which may be due to how you set up your install with accelerate. This happened to me, as well and what fixed it was switching to BF16 (I think it's like Nvidia's proprietary version of fp16), so maybe try that first before trying the stuff below. It may fix it.

The bucket settings are important and training on too high of a resolution can break things. I don't use images with a side wideer/longer than 1280 pixels.

I don't use the GUI, since settings are all over the UI and every time I went through issues and someone suggested a setting change, I could never find it.

But here is my sd-scripts CLI command (what the GUI passes to the actual training script) which I run from CMD, if you want to try it. It should work perfectly for your 40-series GPU.

To use it,

Navigate to your sd-scripts directory (usually in your kohya_ss folder) and activate your VENV with this:

 .\venv\scripts\activate

Copy the commands below and paste the text into a notepad or another text editor.

python.exe sdxl_train_network.py --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="[path\to\basemodel.safetensors]" --train_data_dir=[path\to\training\images] --caption_extension="txt" --resolution=1280,1280 --output_dir=[path for saving model] --"logging_dir=[path for training log]" --network_alpha=1 --save_model_as=safetensors --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004  --unet_lr=0.0004 --network_dim=8  --output_name="[YourLoRA Name_XL (not filename)]" --lr_scheduler_num_cycles="1"  --no_half_vae --full_bf16 --learning_rate="0.0004"  --lr_scheduler="constant" --train_batch_size="1"  --max_train_steps="5100" --save_every_n_epochs="1"  --mixed_precision="bf16" --save_precision="bf16"  --cache_latents --cache_latents_to_disk  --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --mem_eff_attn  --gradient_checkpointing  --xformers --bucket_no_upscale --vae="[C:\path\to\SDXL\VAE.safetensors]"

If you are adding to an existing LoRA you already created, add this one, too:

--network_weight="[path\to\existing\LoRA file\you\made.safetensors]"

Replace everything in brackets with your local name/paths (lose the text/brackets but keep the quotes, especially if you have spaces in any folder names in any paths, or it will crash out).
Save the edited text command in a txt file for the next time you need to train. I do this each time, so I know which one I did last and which settings I used.
Paste it into the console and hit enter.

This setup is for SDXL LoRAs and I use it with my EVGA RTX 3080 12GB FTW3. This post could have saved me 4-8 hours and I hope it does that for someone. Good luck!
-V

0 replies

5KilosOfCheese · 2024-04-05T23:05:31Z

5KilosOfCheese
Apr 5, 2024

Can you post the settings you are trying to train on.

I have had this same issue when using LyCoris and such. I train something but... nothing seems to be going INTO the model.

Post the training settings and lets see if we can debug this out.

However I'd recommend trying just a plain AdamW and regular LoRA, with no captions, just on plain SDXL base model, all default settings. Something simple, clear and obvious as a subject. You have beefy enough hardware that you can probably iterate things quickly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kohya_ss training problem - driving insane #1702

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

kohya_ss training problem - driving insane #1702

Blaisegf Nov 23, 2023

Replies: 3 comments

CCpt5 Nov 25, 2023

Vendaciousness Apr 5, 2024

5KilosOfCheese Apr 5, 2024

Blaisegf
Nov 23, 2023

CCpt5
Nov 25, 2023

Vendaciousness
Apr 5, 2024

5KilosOfCheese
Apr 5, 2024