Replies: 3 comments 1 reply
-
The text component in these models doesn't understand grammar. And this is one of the big problems that you face in generation and training, but if you think this as a separate language to english which just uses english words. Best way to caption things is to conserider everything as a simple statements of facts in the manner of: what does this picture have? Answer: "three marbles" and "blue marble" "green marble" "red marble". Which would make you capition: "three marbles, blue marble, green marble, red marble." The model has most definitely seen "red marble" as a separate thing, so it can understand that for sure. Throw grammar out of the window when making captions. To figure out the best and most effective means of captioning (which changes in every model variant and version) is just trial and error. This is because it is pointless trying to describe something to the AI with terms the AI doesn't have context for. Along with this it is good to try to avoid using terms which are "polluted" by SEO/clickbait nonsense. Example some terms like "diaper", "gag", "sleep", "lie", "lay", "underwear", "bedding", etc. So if you struggle to train something, figure out another term for what you want to train. To figure out whether the terms you want to use are nonsense, throw it in to google and see how many totally irrelvant amazon/ebay/wish/alibaba things it pulls. Example I struggle to train a specific scar pattern and have the material style remain, then I realised that if I train it as "wearing (scar)" as if it was a shirt, I managed to get it to work exactly how I wanted. Now I generally avoid using captions in training, and I don't know how booru models work to begin with. So take this with a truck of salt. But before I train anything with or without captions, I test the model I want to train on extensively in a specific manner to figure out the "language" it prefers. |
Beta Was this translation helpful? Give feedback.
-
Also to add that so many models are results of merges of fine tuned iterations of other merges that may have used a totally different approaches to caption along the way which creates a totally inconsistent mess. If working with community content there is no way around trial and error. |
Beta Was this translation helpful? Give feedback.
-
This is true, and which is why I avoid merged models especially for training. The inconsistency and mess is the most clear in models which mix realism and not-realism. Generally all SDXL models - which are not overfit to degree you can't prompt a circle without it getting irrelevant stuff added to it - can produce the baseline "realism" and the style is generally hidden behind layers of convolution. Unless you specifically know how to call that in to your training, you will fail at achieving exactly the results you want. SDXL is overall cleaner and better performing models compared to say 1.x or 2.x; however it has a lot more complexity in it. If I struggle to train something, then I fall back to the baseline SDXL model and see whether I can do it there. If I can't do it there, then I probably must change my approach. I achieved my long term absolutely nonsense goal of "Angry (politician) as a big diaper wearing toddler throwing tantrum" classic caricature trope. But it took me a long time to figure out the exact problem I was having. Which are basically related to the basic dataset the text and unet layers were trained. Terms like "toddler", "diaper", "tantrum" and such have A LOT of stock photo baggage and SEO/clickbait nonsense tied to it. And force in lots of irrelevant things which the AI can't figure it out. From this I realised that I need to train the overall scene, the "clothing" (the diaper couldn't be trained a diaper) so it had to be trained as "underpants" or sich, then the tantrum had to be trained as a 3rd element to prevent the AI from scaling down the subjet in to an actual toddler (which made many grosteque horrors). After I managed to get the scene to work, I then had to get it to be turned to a classic 1900s newspaper illustration, which honestly was just lots of figuring out how to prompt (which I'm bad at). That little project which spanned 3 versions (I started in 1.4) finally came to desirable results, but the lessons I still got written down in my notebook. And the most important of them is that even if the models use english words, they do not speak or understand english. once you realise this then everything becomes so much easier. And another thing you can use is some of the more common major langauges which often exist along side english in the training data. I thought I could be clever and get around polluted terms by using Finnish words, but I have then realised that many Finnish words have Hindi/India similarities which then polluted things again. HOWEVER! This got me to realise that I can leverage concepts which are in the model but might not be in english. So going back to the OP's question: Figure out the models' preferred language. And here we speak of language as a broader concept to a degree where we intersect with cultral studies. If you think the models text space as a culture, it becomes way easier to understand how to navigate it... And sometimes doing a excel spreadsheet where you string together prompts and brute force things helps to decypher this. |
Beta Was this translation helpful? Give feedback.
-
Hello. When training a lora or dreambooth using captions, every time I use a comma in the caption the sentence before the comma is listed as a separate caption under "ss_tag_frequency". So when listing things that belong to the same group, for example if there are three marbles in the image and I want to list their colors as "three marbles colored red, blue and green" if I use a comma after red, "three marbles colored red" and "blue and green" are listed separately under "ss_tag_frequency". So I was wondering, since the "blue and green" caption is meaningless without the beginning of the sentence, is it better practice to caption it without the comma as "three marbles colored red blue and green" and then use a comma at the end of the sentence before starting to describe another aspect of the image?
Beta Was this translation helpful? Give feedback.
All reactions