support SD3 #1374

kohya-ss · 2024-06-15T13:25:30Z

bghira · 2024-06-16T12:56:28Z

this is a chance to just use Diffusers modules instead of doing everything from scratch. why not take it?

kohya-ss · 2024-06-16T13:18:18Z

There are several reasons for this, but the biggest reason is that it is difficult to extend. For example, LoRA, custom ControlNet and Deep Shrink etc.

Also, considering the various processes in the training scripts, such as conditional loss, SNR, masked loss, etc., the training scripts need to be written from scratch.

bghira · 2024-06-16T16:55:41Z

all of that is done via peft other than deepshrink but you can make a pipeline callback for that.

bghira · 2024-06-16T16:57:01Z

i mean to use the sd3 transformer module from the diffusers project.

it is frustrating to see bespoke versions of things with unreadable comments always in this repository. can you at least leave better comments?

kohya-ss · 2024-06-16T23:09:25Z

I think transformer module should be extendable for the future. In addition, SD3 transformer is based on sd3-ref (Stability AI official repo), and modified by KBlueLeaf to support xformers etc. So it is prior to Diffusers, and not full scratch. I appreciate your understanding.

I will add better comments in future codes, including this PR.

araleza · 2024-07-10T07:58:17Z

Hello, I have been trying out SD3 training. It seems to be working pretty well. 😊

One thing I noticed is that generation of sample images while training is not yet implemented. This made it hard for me to see how my SD3 training was going, and make adjustments.

Implementing full support for all the sample images was difficult, but I found a cheap way to get most features working, and now I have sample images working again. This code is not properly integrated with the usual sample image generation code, but if people want to use it while they wait for a real well-integrated implementation, it does the basics of what's needed.

Just go into your sd3_train.py file, and find this commented-out section:

                # sdxl_train_util.sample_images(
                #     accelerator,
                #     args,
                #     None,
                #     global_step,
                #     accelerator.device,
                #     vae,
                #     [tokenizer1, tokenizer2],
                #     [text_encoder1, text_encoder2],
                #     mmdit,
                # )

and replace that with this:

                # Generate sample images
                if args.sample_every_n_steps is not None and global_step % args.sample_every_n_steps == 0:
                    from sd3_minimal_inference import do_sample
                    from PIL import Image
                    import datetime
                    import numpy as np
                    import shlex
                    import random

                    assert args.save_t5xxl, "When generating sample images in SD3, --save_t5xxl parameter must be set"

                    with open(args.sample_prompts, 'r') as file:
                        lines = [line.strip() for line in file if line.strip()]

                    vae.to("cuda")
                    for line in lines:
                        logger.info(f"Generating image: {line}")

                        if line.find('--') != -1:
                            prompt = line[:line.find('--') - 1].strip()
                            line = line[line.find('--'):]
                        else:
                            prompt = line
                            line = ''

                        parser_s = argparse.ArgumentParser()
                        parser_s.add_argument("--w", type=int, action="store", default=1024, help="image width")
                        parser_s.add_argument("--h", type=int, action="store", default=1024, help="image height")
                        parser_s.add_argument("--s", type=int, action="store", default=30,   help="sample steps")
                        parser_s.add_argument("--l", type=int, action="store", default=4,    help="CFG")
                        parser_s.add_argument("--d", type=int, action="store", default=random.randint(0, 2**32 - 1), help="seed")
                        prompt_args = shlex.split(line)
                        args_s = parser_s.parse_args(prompt_args)

                        # prepare embeddings
                        lg_out, t5_out, pooled = sd3_utils.get_cond(prompt, sd3_tokenizer, clip_l, clip_g, t5xxl) # +'ve prompt
                        cond = torch.cat([lg_out, t5_out], dim=-2), pooled

                        lg_out, t5_out, pooled = sd3_utils.get_cond("", sd3_tokenizer, clip_l, clip_g, t5xxl) # No -'ve prompt
                        neg_cond = torch.cat([lg_out, t5_out], dim=-2), pooled

                        latent_sampled = do_sample(
                            args_s.h, args_s.w, None, args_s.d, cond, neg_cond, mmdit, args_s.s, args_s.l, weight_dtype, accelerator.device
                        )

                        # latent to image
                        with torch.no_grad():
                            image = vae.decode(latent_sampled)
                        image = image.float()
                        image = torch.clamp((image + 1.0) / 2.0, min=0.0, max=1.0)[0]
                        decoded_np = 255.0 * np.moveaxis(image.cpu().numpy(), 0, 2)
                        decoded_np = decoded_np.astype(np.uint8)
                        out_image = Image.fromarray(decoded_np)

                        # save image
                        output_dir = os.path.join(args.output_dir, "sample")
                        os.makedirs(output_dir, exist_ok=True)
                        output_path = os.path.join(output_dir, f"{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.png")
                        out_image.save(output_path)

                    vae.to("cpu")

It supports a caption followed by the usual optional --w, --h, --s, --l, --d (for width, height, steps, cfg, and seed). It doesn't support negative captions, and it won't work right with captions longer than 75 tokens.

I'm finding sample image generation to be helpful. For example, I notice that most of my sample output images start off by looking brighter than expected (with white or bright backgrounds). Edit: Might have been my cfg of 7.5; SD3 seems to want lower cfgs. I had to push the sample count up as the cfg was lowered. Image quality still seems poor though, compared to what some people are getting out of SD3.

araleza · 2024-07-10T12:21:10Z

Think I've found an issue that's causing the poor quality SD3 samples. The do_sample() function is not filling in the shift parameter that's required by SD3, and it's defaulting to 1.0 instead of the recommended 3.0:

class ModelSamplingDiscreteFlow:
    """Helper for sampler scheduling (ie timestep/sigma calculations) for Discrete Flow models"""

    def __init__(self, shift=1.0):
        self.shift = shift
        timesteps = 1000
        self.sigmas = self.sigma(torch.arange(1, timesteps + 1, 1))

From sd-script's sd3_minimal_inference.py function, do_sample()

    model_sampling = sd3_utils.ModelSamplingDiscreteFlow()

From the SD3 paper:

~~The paper also seems to say that these shifts to the sigmas should be present during training. Are these maybe missing too, @kohya-ss?~~ (Edit: No, a shift value of 3.0 is already set up correctly during training)

kohya-ss · 2024-07-10T23:02:29Z

Think I've found an issue that's causing the poor quality SD3 samples. The do_sample() function is not filling in the shift parameter that's required by SD3, and it's defaulting to 1.0 instead of the recommended 3.0:

Thank you! I fixed it. The generated images seemed to be better now.

kohya-ss · 2024-07-10T23:09:06Z

I agree that the sample image generation is really useful. In my understanding, T5XXL is on CPU, so I wonder get_cond may take a long time. How much time it takes?

I think it might be necessary to get TE's output for the sampling prompt in advance, at the same time the TE caching. However, if T5XXL works on CPU with an acceptable time, the implementation of the sample generation will be much easier (like your implementation :) .

bghira · 2024-07-11T00:42:58Z

it takes about 30-50 seconds to run T5 XL on the CPU, i think XXL is even worse latency for each embed

araleza · 2024-07-11T08:34:58Z

I agree that the sample image generation is really useful. In my understanding, T5XXL is on CPU, so I wonder get_cond may take a long time. How much time it takes?

@kohya-ss, the calls to get_cond() only take around 2 seconds each on my machine. The whole sample image generation takes just 16 seconds per image for me, and I am still doing 80 sample steps for the images. :D

My PC is an ordinary (but good) home PC machine with a 13th gen Intel i7, and I've got 64 GB of CPU RAM. Perhaps the people finding the T5 XL to be very slow are running out of CPU memory and swapping the T5 XL out to disk without realizing? @bghira

bghira · 2024-11-06T00:08:48Z

you are correct and in fact poorly "training" the text encoder without contrastive loss also causes bleeding

dsienra · 2024-11-06T00:11:56Z

... not training the TE is not an option because is going to bleed and will not learn the concepts correctly.

Correct me if I'm wrong, but I don't believe it's true. All base models are trained with vanilla text encoders and perfectly capture every concepts. While training with a CLIP text encoder can accelerate learning, it doesn't prevent concept bleed or improve concept understanding. New concept can be learned without modifying text encoder.

On sd1.5 sdxl etc, the clip encoder is embebed on the checkpoint and normally you train it for best quality, for training multiple concepts of the same class training the text encoder is mandatory to prevent concept bleeding, if you are going to train just one person for example, training the TE is not so important but if you train 5 man for example they will bleed to each other and you will lost resemblance, I trained 4 people on sd3.5 and kohya saves 3 files one for each clip and 1 for the unet, if you load the vanilla text encoders and not the trained ones on inference the results are night and day, from perfect resemblance to none, the result is a mix of all trained persons, clip_g is the one that make the major difference if you replace it, just test it. this problem was not evident on older models because the clip was ombebed on the checkpoints and not loaded separately

NonaBuhtig · 2024-11-06T03:42:29Z

On sd1.5 sdxl etc, the clip encoder is embebed on the checkpoint and normally you train it for best quality.

Training clip was a trick that emerged in early days of finetuning SD1_4. While this can accelerate the process, it doesn't necessarily lead to a better-quality model. As bghira have noted, it can even be worst in certain conditions.
I've made few SD1_5 finetunes without training clip, and had great result. It was just a little longer to make.
The fact that the clip encoder is embedded on the checkpoint, does not make mandatory to train it.

for training multiple concepts of the same class training the text encoder is mandatory to prevent concept bleeding, if you are going to train just one person for example, training the TE is not so important but if you train 5 man for example they will bleed to each other and you will lost resemblance, I trained 4 people on sd3.5 and kohya saves 3 files one for each clip and 1 for the unet, if you load the vanilla text encoders and not the trained ones on inference the results are night and day,

I'm not sure I understand. SD 3.5 relies on CLIP-G, CLIP-L, and T5. If you train one or more of these text encoders but don't use them during inference, you'll likely get unexpected results.

from perfect resemblance to none, the result is a mix of all trained persons, clip_g is the one that make the major difference if you replace it, just test it.

If I follow your reasoning, you seems to suggest that base models like SD, XL, and FLUX represent all men or women as a single, average of the dataset, because they were trained with vanilla text encoders. And this isn't the case, even in models with limited diversity.

this problem was not evident on older models because the clip was ombebed on the checkpoints and not loaded separately

I still don't understand why embedding the text encoder in the checkpoint affects whether or not it needs to be trained.
when you share your finetunes, you can share it with trained or a vanilla ones.

At last, while I personally never experimented multi-concept fine-tuning, I'm pretty sure the challenges primarily stem from model architecture and/or training methodology, not only from text encoder training, even if I agree that this can help in some case.

dsienra · 2024-11-06T04:52:27Z

On sd1.5 sdxl etc, the clip encoder is embebed on the checkpoint and normally you train it for best quality.

Training clip was a trick that emerged in early days of finetuning SD1_5. While this can accelerate the process, it doesn't necessarily lead to a better-quality model. As bghira have noted, it can even be worst in certain conditions. I've made few SD1_5 finetunes without training clip, and had great result. It was just a little longer to make. The fact that the clip encoder is embedded on the checkpoint, does not make mandatory to train it.

for training multiple concepts of the same class training the text encoder is mandatory to prevent concept bleeding, if you are going to train just one person for example, training the TE is not so important but if you train 5 man for example they will bleed to each other and you will lost resemblance, I trained 4 people on sd3.5 and kohya saves 3 files one for each clip and 1 for the unet, if you load the vanilla text encoders and not the trained ones on inference the results are night and day,

I'm not sure I understand. SD 3.5 relies on CLIP-G, CLIP-L, and T5. If you train one or more of these text encoders but don't use them during inference, you'll likely get unexpected results.

from perfect resemblance to none, the result is a mix of all trained persons, clip_g is the one that make the major difference if you replace it, just test it.

If I follow your reasoning, you seems to suggest that base models like SD, XL, and FLUX represent all men or women as a single, average of the dataset, because they were trained with vanilla text encoders. And this isn't the case, even in models with limited diversity.

this problem was not evident on older models because the clip was ombebed on the checkpoints and not loaded separately

I still don't understand why embedding the text encoder in the checkpoint affects whether or not it needs to be trained. when you share your finetunes, you can share it with trained or a vanilla ones.

At last, while I personally never experimented multi-concept fine-tuning, I'm pretty sure the challenges primarily stem from model architecture and/or training methodology, not only from text encoder training, even if I agree that this can help in some case.

You are right, the base models like flux and sd3.5 come with the vanilla text encoders and it's true they learned the concepts perfectly with text encoder training disabled. when talked about sd1.5 and sdxl some custom finetuned models have trained the TE and others not but because is embebed you don't really knows if was trained or not it just works, but what you said is completely reasonable. training the text encoder is not mandatory but maybe is needed to have results without bleeding in a reasonable time, to mimic how the base models were trained maybe I need to train with a very low learning rate for a huge number of epochs to train 5 new people in the model without bleeding to each other, flux is harder to train because is distilled but any not distilled base model can be train without training the TE. in my case I need to train it if I want good results in a reasonable time.
my training test on sd3.5 without training the text encoder were a failure, it bleeds a lot yes may be is the learning rate, may be it needs more epochs, really idk but with the TE it worked, but you right is not mandatory. thanks for the good information

bghira · 2024-11-06T12:34:16Z

you can easily check the contents of the text encoder to see if it's trained or not, but surprisingly if you don't train the text encoder then people who use the model can keep the base frozen one loaded and save time and disk space.

multi-concept training doesn't change the fact that CLIP is still CLIP and needs contrastive loss with image pairs shown during training time. it uses a hinge loss function, which kohya does not implement.

dsienra · 2024-11-06T15:21:24Z

you can easily check the contents of the text encoder to see if it's trained or not, but surprisingly if you don't train the text encoder then people who use the model can keep the base frozen one loaded and save time and disk space.

multi-concept training doesn't change the fact that CLIP is still CLIP and needs contrastive loss with image pairs shown during training time. it uses a hinge loss function, which kohya does not implement.

Thanks for your response.

In the case of SD 1.5 and SDXL, the CLIP model is small in size and always embedded in the checkpoint, so you can see its content. But what I meant is that the user really doesn't care because it is embedded, and always the correct TE is loaded.

In newer models, if you load the CLIP from a different file, it is very relevant to load the correct trained TE models. The model I'm training is for personal use. I trained on SDXL a model with 20 people just for fun—family and friends. On SDXL, the bleeding is minimal or nonexistent. I achieved great results training the TE, much better than without. Now I want to do the same with newer models. Flux.dev is distilled; it bleeds, and using the same dataset is a disaster. Now with SD3.5, I tried without training the TE, and I had the same problem: bleeding between the trained subjects. Training the CLIPs, the results were much better, so in my case, it helps.

I agree it is better to have the text encoders frozen so I can have just the vanilla ones to save disk space, but how do I fix the bleeding? I know that you understand what you are talking about. I'm just a user, and what I know I learned from discussions like this. I just want to train multiple people with no bleeding; I don't care if the training is with frozen TEs or not. I don't know what a hinge loss function is or how it works; I just want my training to work without bleeding. If you have any suggestions for my case, they will be very welcome.

Thanks, I appreciate your response and the valuable information.

bghira · 2024-11-06T15:41:19Z

flux doesn't bleed because it is distilled. every model bleeds whether CLIP is involved or not. Flux is particularly prone to bleeding because it is 12 billion parameters and a LoRA does not capture enough information to precisely train anything. LyCORIS LoKr works monumentally better at approximating the results of a full-rank training session.

you are comparing 900M / 2.6B parameter DDPM u-net with 12B parameter rectified flow-matching diffusion transformer.

even beyond parameter scales there is the channel count in the VAE that makes learning harder - higher channel count = harder objective.

edit: also, Flux's transformer model transforms the text and image inputs through each layer, being multi-modal. training these layers is effectively training a "text encoder".

dsienra · 2024-11-06T16:03:39Z

flux doesn't bleed because it is distilled. every model bleeds whether CLIP is involved or not. Flux is particularly prone to bleeding because it is 12 billion parameters and a LoRA does not capture enough information to precisely train anything. LyCORIS LoKr works monumentally better at approximating the results of a full-rank training session.

you are comparing 900M / 2.6B parameter DDPM u-net with 12B parameter rectified flow-matching diffusion transformer.

even beyond parameter scales there is the channel count in the VAE that makes learning harder - higher channel count = harder objective.

edit: also, Flux's transformer model transforms the text and image inputs through each layer, being multi-modal. training these layers is effectively training a "text encoder".

I was talking about a full finetune not lora but will be nice to try LyCORIS LoKr and see how it works, as far as I know being a distilled model is a problem for training because the model tends to colapse I tried with the de-distilled version and worked better on lora but on full finetune I hade bleeding too, in all my tests it bleeds even on a full finetune if you train multiple subjects, my best result with flux for multiple subjects was with the de-distilled model training a lora with TE training enable. kohya does not support TE training on full finetuning at the moment for flux. may be is the hyperparameters, I'm using adafactor with some setting that I don't understand its purpose. "scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01" also could be the learning rate, I tried with many, what I know is flux is perfect to train 1 person at a time, it achieves awesome results, but for multiple people it simply doesn't work with my config, it bleeds a lot. so now I traying with sd3.5 large.

dsienra · 2024-11-06T16:08:32Z

Must to be a way to avoid bleeding and train multiple people but I don't know how and at the momento nobody could help me with this. any suggestions will be appreciated.

bghira · 2024-11-06T16:09:39Z

bleeding for full-rank training is almost always a dataset issue. as mentioned earlier, the model learnt millions of concepts without bleeding during pretraining due to the width of the dataset and the variety of content contained within.

people seem to think they should be able to finetune fully with just 300 images or less but the answer has always been longer training runs with more well-labeled data and carefully selected hyperparameters.

can cook and bleed a model with any training method.

NonaBuhtig · 2024-11-06T16:14:26Z

people seem to think they should be able to finetune fully with just 300 images or less but the answer has always been longer training runs with more well-labeled data and carefully selected hyperparameters.

Exactly what I think. A carefully balanced and labeled dataset is key.👌
And also patience 🤞

araleza · 2024-11-06T16:16:57Z

Must to be a way to avoid bleeding and train multiple people but I don't know how and at the momento nobody could help me with this. any suggestions will be appreciated.

I don't know what the memory constraints are like for full fine-tuning, but I notice you're using Adafactor. In my experience, it's a particularly weak optimizer, designed to reduce memory usage at the expense of learning quality. You could give --optimizer_type adamwschedulefree a try?

bghira · 2024-11-06T16:18:03Z

well orthogonality (cardinality in the dataset, the 'balance') is not as simple as "ok we're training 4 concepts with 1k images so we need 250 images from each" because some of those concepts learn more easily, and others harder.

i wish the sigmoid distance can be measured for a concept in a meaningful way. but this doesn't always translate to learnability either. for example training on psychedelic style for flux was pretty far from the base outputs but that thing fried in 300-400 steps. needed Much Lower learning rate.

when combining male subjects and female subjects in the training data, the males always are more difficult to learn the likeness. there is a lot more female pretraining data on the web.

similarly for typography dataset, some fonts are more prevalent in pretraining and will be easily biased during finetuning.

a good classic example is a single subject. use their real name for an inference test. check that "Jane Smith" is:

human
same gender
within the age bracket of the training subject
within the same general skin tone / ethnic background

the further the base model's understanding of these from your training subject, the harder their likeness is to train into it. not impossible but when doing multi-subject all of these factors combine to create a truly remarkable obstacle to overcome.

dsienra · 2024-11-06T16:41:24Z

rtx 3090ti 24gb vram. I'm doing just a dreambooth not a full finetuning, I want to add for example 10 people to the model nothing more, 25 images per subject captions "NameSurname man" "NameSurname woman" etc. something like that, it worked on sdxl with lora and finetuning, TE enabled in both cases. now with flux it doesn't work it bleeds a lot, now with sd3.5 it works quite well in my first tests with TE enabled, with TE disabled it bleeds a lot. that's my situation. I want to replicate what worked on sdxl but with the newer models. I use adafactor because my vram limit is 24gb. I'm training locally. It is imposible? or there is a way to train this in a razonable time. may be works with TE disabled and a very low lr like 1e-7 but will take months to finish. really adk

bghira · 2024-11-06T16:57:55Z

10 people wow, is that all? two is hard enough eh. I simply don't believe you that a simple standard LoRA with CLIP training makes 10 subjects work without degrading the rest of the model. sorry.

dsienra · 2024-11-06T17:11:45Z

10 people wow, is that all? two is hard enough eh. I simply don't believe you that a simple standard LoRA with CLIP training makes 10 subjects work without degrading the rest of the model. sorry.

Yes, may be the model was degraded but usable, you always can disable the lora if you need, enable it again to inpaint the face with adetailer, etc. I forget to mention I used regularization images with captions class people for the sdxl trainings.

dsienra · 2024-11-06T17:27:13Z

10 people wow, is that all? two is hard enough eh. I simply don't believe you that a simple standard LoRA with CLIP training makes 10 subjects work without degrading the rest of the model. sorry.

I'm going to explain more how the sdxl model end up working, If I made a log prompt like "closeup portrait shot of a cyberpunk NameSurname woman in a scenic dystopian environment, intricate, elegant, highly detailed, centered, digital painting, concept art, smooth, sharp focus, illustration" I get a quite good image but the face resemblance is diminished, then in adetailer prompt I put just NameSurname woman and the face is inpainted with a great resemblance and I get the final image. if is not perfect i can impaint again until I get the result I want.

dsienra · 2024-11-06T17:47:52Z

Ok I'm going to tray again without TE training on SD3.5 with a low learning rate 5e-7 for 500 epochs 4 people same class without regularization images and see if it learns something 25 images each. maybe a total failure but who knows.

I'm going insane? is a terrible idea? I want your opinion guys

sdbds · 2024-11-06T18:59:38Z

bleeding for full-rank training is almost always a dataset issue. as mentioned earlier, the model learnt millions of concepts without bleeding during pretraining due to the width of the dataset and the variety of content contained within.

people seem to think they should be able to finetune fully with just 300 images or less but the answer has always been longer training runs with more well-labeled data and carefully selected hyperparameters.

can cook and bleed a model with any training method.

Flux is completely incapable of fine-tuning multiple concepts because of distillation, and I thought most people knew that.

bghira · 2024-11-06T19:27:25Z

no, this is absolutely untrue. distilled models can be fine-tuned just as any other model either with de-distillation as an objective or with distillation-preserved training.

https://wandb.ai/bghira/preserved-reports/reports/Bghira-s-Search-for-Reliable-Multi-Subject-Training--Vmlldzo5MTY5OTk1

recris · 2024-11-06T19:49:36Z

I wonder how effective would be to use a combined embedding + DiT training approach (aka. pivotal tuning) to solve this kind of thing. I've seen some successful examples in less capable models (w/ CLIP), and I am curious about its effectiveness when paired with the more powerful T5 encoder.

But we'd have to get a textual inversion trainer script working first.

sdbds · 2024-11-06T21:55:55Z

no, this is absolutely untrue. distilled models can be fine-tuned just as any other model either with de-distillation as an objective or with distillation-preserved training.

https://wandb.ai/bghira/preserved-reports/reports/Bghira-s-Search-for-Reliable-Multi-Subject-Training--Vmlldzo5MTY5OTk1

I have to specifically point out is that I meant dreambooth like that to fine tune based on the original model itself, not to rebuild a PEFT similar to a LoRA or LoKr external model. You're using lokr to build an external model that doesn't bleed and has good multi-conceptual performance, as some people found out when flux first came out.

Here's my test from 3 months ago.
https://x.com/bdsqlsz/status/1825072306726785364

bghira · 2024-11-06T21:58:35Z

ok, well, i'm sorry your test did not work 3 months ago. here is a full multi-GPU finetune of flux that did not use any PEFT methods. but it's just wrong to tell people full training just doesn't work. too many people have done it, proving this statement wrong.

sdbds · 2024-11-06T22:12:21Z

ok, well, i'm sorry your test did not work 3 months ago. here is a full multi-GPU finetune of flux that did not use any PEFT methods. but it's just wrong to tell people full training just doesn't work. too many people have done it, proving this statement wrong.

Obviously I was referring to the dreambooth method in this repository, and this is in response to the person above, not everyone will modify the code to use the original CFG to train, and most people don't even use the H100, only consumer graphics cards.
The question of success has been transformed from technology to cost.
While the claims may be somewhat arbitrary, I hope the guys will move to SD3.5 or Aura non-distillation models.

sdbds · 2024-11-06T22:13:09Z

ok, well, i'm sorry your test did not work 3 months ago. here is a full multi-GPU finetune of flux that did not use any PEFT methods. but it's just wrong to tell people full training just doesn't work. too many people have done it, proving this statement wrong.

I'm glad it worked out for you, but I rarely read reddit and mostly just read Twitter and github, and there's not much on there about the model.

bghira · 2024-11-06T22:34:39Z

i don't think there is a question of technology or cost. a 4090 works just as well as H100 for fine-tuning flux. you can use SD 3.5 or Aura all you like. just don't tell people misinformation about Flux?

also, no code changes are required to use 'the original CFG to train'. it is just parameters like setting the flux guidance value to 1 or 3.5. it is really nothing special and we've been doing it since early August now.

sdbds · 2024-11-06T23:15:21Z

i don't think there is a question of technology or cost. a 4090 works just as well as H100 for fine-tuning flux. you can use SD 3.5 or Aura all you like. just don't tell people misinformation about Flux?

also, no code changes are required to use 'the original CFG to train'. it is just parameters like setting the flux guidance value to 1 or 3.5. it is really nothing special and we've been doing it since early August now.

If it wasn't a cost issue or a technical issue, it's hard to imagine why there aren't more than 10 non-merged Flux checkpoint models in community right now.

I don't doubt the wisdom of the community, so I blame it on the cost implications of the model itself.

sdbds · 2024-11-06T23:42:18Z

i don't think there is a question of technology or cost. a 4090 works just as well as H100 for fine-tuning flux. you can use SD 3.5 or Aura all you like. just don't tell people misinformation about Flux?

also, no code changes are required to use 'the original CFG to train'. it is just parameters like setting the flux guidance value to 1 or 3.5. it is really nothing special and we've been doing it since early August now.

I checked the posts on reddit and it looks like I'm not alone in my view of the fine-tuning terminology.
Of course I understand that LoKr approximates Full Rank fine-tuning, but that still use external model PEFT tuning.

bghira · 2024-11-07T00:11:43Z

hey, i noticed you're bringing up screenshots and shifting the focus away from the main issue. let's keep the conversation on topic to respect everyone's time, including the project owner and other contributors. the question's been answered, so let's wrap this up. thanks for understanding.

add sd3 models and inference script

e526828

5KilosOfCheese mentioned this pull request Jun 18, 2024

Stable Diffusion 3 bmaltais/kohya_ss#2588

Open

kohya-ss added 15 commits June 23, 2024 14:12

Merge branch 'dev' into sd3

a518e3c

sd3 training

d53ea22

fix to use zero for initial latent

0fe4eaf

workaround for long caption ref #1382

4802e4a

support text_encoder_batch_size for caching

8f2ba27

fix assertion for experimental impl ref #1389

828a581

fix resolution in metadata for sd3

381598c

re-fix assertion ref #1389

66cf435

Fix fp16 mixed precision, model is in bf16 without full_bf16

1908646

Fix to work full_bf16 and full_fp16.

ea18d5b

fix to work T5XXL with fp16

50e3d62

WIP: new latents caching

c9de7c4

load models one by one

3ea4fce

fix typo

9dc7997

WIP: update new latents caching

3d40292

kohya-ss added 2 commits July 11, 2024 08:00

Fix shift value in SD3 inference.

6f0e235

update README

b8896aa

Fix to work without latent cache #1758

4384903

support SD3 #1374

Are you sure you want to change the base?

support SD3 #1374

Conversation

kohya-ss commented Jun 15, 2024 • edited Loading

bghira commented Jun 16, 2024

kohya-ss commented Jun 16, 2024

bghira commented Jun 16, 2024

bghira commented Jun 16, 2024

kohya-ss commented Jun 16, 2024

araleza commented Jul 10, 2024 • edited Loading

araleza commented Jul 10, 2024 • edited Loading

kohya-ss commented Jul 10, 2024

kohya-ss commented Jul 10, 2024

bghira commented Jul 11, 2024

araleza commented Jul 11, 2024 • edited Loading

bghira commented Nov 6, 2024

dsienra commented Nov 6, 2024

NonaBuhtig commented Nov 6, 2024 • edited Loading

dsienra commented Nov 6, 2024

bghira commented Nov 6, 2024

dsienra commented Nov 6, 2024

bghira commented Nov 6, 2024 • edited Loading

dsienra commented Nov 6, 2024 • edited Loading

dsienra commented Nov 6, 2024 • edited Loading

bghira commented Nov 6, 2024

NonaBuhtig commented Nov 6, 2024 • edited Loading

araleza commented Nov 6, 2024

bghira commented Nov 6, 2024

dsienra commented Nov 6, 2024 • edited Loading

bghira commented Nov 6, 2024

dsienra commented Nov 6, 2024

dsienra commented Nov 6, 2024

dsienra commented Nov 6, 2024 • edited Loading

sdbds commented Nov 6, 2024

bghira commented Nov 6, 2024

recris commented Nov 6, 2024

sdbds commented Nov 6, 2024 • edited Loading

bghira commented Nov 6, 2024

sdbds commented Nov 6, 2024

sdbds commented Nov 6, 2024

bghira commented Nov 6, 2024 • edited Loading

sdbds commented Nov 6, 2024

sdbds commented Nov 6, 2024

bghira commented Nov 7, 2024

kohya-ss commented Jun 15, 2024 •

edited

Loading

araleza commented Jul 10, 2024 •

edited

Loading

araleza commented Jul 10, 2024 •

edited

Loading

araleza commented Jul 11, 2024 •

edited

Loading

NonaBuhtig commented Nov 6, 2024 •

edited

Loading

bghira commented Nov 6, 2024 •

edited

Loading

dsienra commented Nov 6, 2024 •

edited

Loading

dsienra commented Nov 6, 2024 •

edited

Loading

NonaBuhtig commented Nov 6, 2024 •

edited

Loading

dsienra commented Nov 6, 2024 •

edited

Loading

dsienra commented Nov 6, 2024 •

edited

Loading

sdbds commented Nov 6, 2024 •

edited

Loading

bghira commented Nov 6, 2024 •

edited

Loading