What are tokens in AI image generation? (BetterWaifu & Stable Diffusion)
By gerogero
Updated: December 29, 2024
The Power of Words
The words we use in prompting are called tokens. Each token has its own power, which depends on its frequency of occurrence in the dataset used for training the AI. We’ll get into the technical details in a bit, but first we’ll focus on how to think about tokens practically.
Let’s take a look at an example: I prompt “1girl wearing white oversized coat with >< and outstretched arms“. If each word is a token, we have 11 tokens.
(When typing this prompt into BetterWaifu, you’ll notice the token counter displays 13. This is because <start> and <end> tokens are added by the AI under the hood. More on this in the next section. )
Prepositions like “with” “for” “at” “in” and particles like “and” “a” “to” count as tokens. Thus, they also have power and influence the image.
Let’s try the same prompt without any particles or prepositions: “1girl, white oversized coat, ><, outstretched arms“. Commas count as tokens, so we are still at 11 tokens, 13 with the start and end tokens. The result is quite different:
I like this result better because it looks like more attention was put on the coat. The comma token is used to demarcate between different concepts.
Now, while using comma-separated tags is my recommended approach to prompting on BetterWaifu, that doesn’t mean removing all prepositions and particles always makes the result better. Sometimes, it’s important to use prepositions to indicate relative positions.
Here’s a simple example of when a preposition makes all the difference:
In addition to a token’s inherent strength, its position in the prompt is also weighted. Tokens at the beginning have greater weight than tokens at the end. It’s important to understand this, as a weak token at the end of the prompt may have no impact on the image. Conversely, a strong token at the beginning can completely determine the outcome.
To control the strength of a token, you can use the construction (token:1.0), where the number represents the strength of the token. 0 – no influence, 1 – normal weight. I usually don’t go past 1.5. Experimenting with different strength values can help you fine-tune the desired level of control over the tokens in your prompts
Technical Explanation
Tokenizing is a common way to handle text data in AI generation. We use it to change the text into numbers and process them with neural networks.
Stable Diffusion tokenizes a text prompt into a sequence of tokens. For example, it splits the text prompt a cute and adorable bunny into the tokens a, cute, and, adorable, and bunny. Then Stable Diffusion adds <start> and <end> tokens at the beginning and the end of the tokens.
The resulting token sequence for the above example would be <start>, a, cute, and, adorable, bunny, and <end> (7 tokens).
For easier computation, Stable Diffusion keeps the token sequences of any text prompts to have the same length of 77 by padding or truncating. If the input prompt has fewer than 77 tokens, under the hood <end> tokens are added to the end of the sequence until it reaches 77 tokens.
The length of 77 was set to balance performance and computational efficiency. Different software will have different behavior if more then 77 tokens are used:
- The first 77 tokens are retained and the rest are cut out.
- The entire prompt is broken into chunks of 75, start and end tokens are added, and each chunk is processed in order. This is the method BetterWaifu uses.