Use of commas in captions #1969

tncrdn · 2024-02-16T20:20:50Z

tncrdn
Feb 16, 2024

Hello. When training a lora or dreambooth using captions, every time I use a comma in the caption the sentence before the comma is listed as a separate caption under "ss_tag_frequency". So when listing things that belong to the same group, for example if there are three marbles in the image and I want to list their colors as "three marbles colored red, blue and green" if I use a comma after red, "three marbles colored red" and "blue and green" are listed separately under "ss_tag_frequency". So I was wondering, since the "blue and green" caption is meaningless without the beginning of the sentence, is it better practice to caption it without the comma as "three marbles colored red blue and green" and then use a comma at the end of the sentence before starting to describe another aspect of the image?

5KilosOfCheese · 2024-02-25T01:44:05Z

5KilosOfCheese
Feb 25, 2024

The text component in these models doesn't understand grammar. And this is one of the big problems that you face in generation and training, but if you think this as a separate language to english which just uses english words. Best way to caption things is to conserider everything as a simple statements of facts in the manner of: what does this picture have? Answer: "three marbles" and "blue marble" "green marble" "red marble". Which would make you capition: "three marbles, blue marble, green marble, red marble." The model has most definitely seen "red marble" as a separate thing, so it can understand that for sure.

Throw grammar out of the window when making captions. To figure out the best and most effective means of captioning (which changes in every model variant and version) is just trial and error. This is because it is pointless trying to describe something to the AI with terms the AI doesn't have context for. Along with this it is good to try to avoid using terms which are "polluted" by SEO/clickbait nonsense. Example some terms like "diaper", "gag", "sleep", "lie", "lay", "underwear", "bedding", etc. So if you struggle to train something, figure out another term for what you want to train. To figure out whether the terms you want to use are nonsense, throw it in to google and see how many totally irrelvant amazon/ebay/wish/alibaba things it pulls. Example I struggle to train a specific scar pattern and have the material style remain, then I realised that if I train it as "wearing (scar)" as if it was a shirt, I managed to get it to work exactly how I wanted.

Now I generally avoid using captions in training, and I don't know how booru models work to begin with. So take this with a truck of salt. But before I train anything with or without captions, I test the model I want to train on extensively in a specific manner to figure out the "language" it prefers.

0 replies

madrooky · 2024-02-25T18:28:08Z

madrooky
Feb 25, 2024

Also to add that so many models are results of merges of fine tuned iterations of other merges that may have used a totally different approaches to caption along the way which creates a totally inconsistent mess. If working with community content there is no way around trial and error.
I have given up on that and started to train my own 1.5 based model. Captioning is also my primary challenge but simply because of the size of my data set and the totally insufficient quality of captioning models currently available.

0 replies

5KilosOfCheese · 2024-02-26T12:15:59Z

5KilosOfCheese
Feb 26, 2024

Also to add that so many models are results of merges of fine tuned iterations of other merges that may have used a totally different approaches to caption along the way which creates a totally inconsistent mess. If working with community content there is no way around trial and error. I have given up on that and started to train my own 1.5 based model. Captioning is also my primary challenge but simply because of the size of my data set and the totally insufficient quality of captioning models currently available.

This is true, and which is why I avoid merged models especially for training. The inconsistency and mess is the most clear in models which mix realism and not-realism. Generally all SDXL models - which are not overfit to degree you can't prompt a circle without it getting irrelevant stuff added to it - can produce the baseline "realism" and the style is generally hidden behind layers of convolution. Unless you specifically know how to call that in to your training, you will fail at achieving exactly the results you want. SDXL is overall cleaner and better performing models compared to say 1.x or 2.x; however it has a lot more complexity in it.

If I struggle to train something, then I fall back to the baseline SDXL model and see whether I can do it there. If I can't do it there, then I probably must change my approach. I achieved my long term absolutely nonsense goal of "Angry (politician) as a big diaper wearing toddler throwing tantrum" classic caricature trope. But it took me a long time to figure out the exact problem I was having. Which are basically related to the basic dataset the text and unet layers were trained. Terms like "toddler", "diaper", "tantrum" and such have A LOT of stock photo baggage and SEO/clickbait nonsense tied to it. And force in lots of irrelevant things which the AI can't figure it out. From this I realised that I need to train the overall scene, the "clothing" (the diaper couldn't be trained a diaper) so it had to be trained as "underpants" or sich, then the tantrum had to be trained as a 3rd element to prevent the AI from scaling down the subjet in to an actual toddler (which made many grosteque horrors). After I managed to get the scene to work, I then had to get it to be turned to a classic 1900s newspaper illustration, which honestly was just lots of figuring out how to prompt (which I'm bad at).

That little project which spanned 3 versions (I started in 1.4) finally came to desirable results, but the lessons I still got written down in my notebook. And the most important of them is that even if the models use english words, they do not speak or understand english. once you realise this then everything becomes so much easier. And another thing you can use is some of the more common major langauges which often exist along side english in the training data. I thought I could be clever and get around polluted terms by using Finnish words, but I have then realised that many Finnish words have Hindi/India similarities which then polluted things again. HOWEVER! This got me to realise that I can leverage concepts which are in the model but might not be in english.

So going back to the OP's question: Figure out the models' preferred language. And here we speak of language as a broader concept to a degree where we intersect with cultral studies. If you think the models text space as a culture, it becomes way easier to understand how to navigate it... And sometimes doing a excel spreadsheet where you string together prompts and brute force things helps to decypher this.

1 reply

madrooky Feb 26, 2024

Absolutely, when thinking of how to name things I figured two things to help me.
First of all to conceptualize my project, eg thinking of each feature that is important to the subject in "nested concepts" like face is a concept which entails eyes, nose mouth etc, which then entail colour, form and such things. It can get really deep depending on the desired quality and precision. In other words, those concepts can also be called "class" which confused me in the beginning a lot as nowhere is properly explained what classes really are in a practical sense. Or I am just a little slow, not sure...
Secondly, I think of a model as a really dumb and demotivated child I want to teach something. Might sound a bit stupid, but it really helps. When a child enters school they don't teach them higher maths, they start with very simple things. And that is a good way to approach training. That includes figuring out what the "dumb child" knows already and pick up on that... :D
And that then includes very simple but clear language, and practical examples of a state I want it to learn. When I want to teach it how a human raises a hand, it needs to now how a hand actually looks like and where on the human body it belongs. It can get really complex for seemingly simple things, but the effort can result in really precise output if done properly.
And it does not seem both of these ideas a wide spread among the AI teachers out there otherwise we would have much less deformation issues right of the bat... :D
The thing is also that the training of a Lora and its quality depends a lot of the model it is based of and the model it is then used on, there can be huge differences. That is why I quickly stopped messing with Lora's in the first place and tried myself on pre-trained checkpoints because I don't want to train a Lora for each checkpoint I intend to use. I mean I don't want to rely on 3rd party helpers as little as even possible, and I know that is possible when the models are just focussed properly on a certain theme or topic. The "AllinOne" approach is not working in favour of quality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of commas in captions #1969

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Use of commas in captions #1969

tncrdn Feb 16, 2024

Replies: 3 comments · 1 reply

5KilosOfCheese Feb 25, 2024

madrooky Feb 25, 2024

5KilosOfCheese Feb 26, 2024

madrooky Feb 26, 2024

tncrdn
Feb 16, 2024

Replies: 3 comments 1 reply

5KilosOfCheese
Feb 25, 2024

madrooky
Feb 25, 2024

5KilosOfCheese
Feb 26, 2024