What are tokens in AI image generation? (Stable Diffusion)
By gerogero
Updated: December 29, 2024
To understand tokens, we have to understand a bit about how Stable Diffusion works.
Stable Diffusion first changes the text prompt into a text representation, numerical values that summarize the prompt. The text representation is used to generate an image representation. This image representation is then upscaled into a high-resolution image.
These steps are summarized as follows:
- Text Representation Generation: Stable Diffusion converts a text prompt into a text vector representation (Tokenization and Text encoding)
- Image Representation Refining: Starting with random noise, Stable Diffusion refines the image representation little by little, with the guidance of the text representation. Stable Diffusion repeats the refining over multiple timesteps (50 in our Diffusion Explainer).
- Image Upscaling: Stable Diffusion upscales the image representation into a high-resolution image.
Tokenizing
Tokenizing is a common way to handle text data. We use it to change the text into numbers and process them with neural networks.
Stable Diffusion tokenizes a text prompt into a sequence of tokens. For example, it splits the text prompt a cute and adorable bunny into the tokens a, cute, and, adorable, and bunny. Also, to mark the beginning and end of the prompt, Stable Diffusion adds <start> and <end> tokens at the beginning and the end of the tokens.
The resulting token sequence for the above example would be <start>, a, cute, and, adorable, bunny, and <end> (7 tokens).
For easier computation, Stable Diffusion keeps the token sequences of any text prompts to have the same length of 77 by padding or truncating. If the input prompt has fewer than 77 tokens, under the hood <end> tokens are added to the end of the sequence until it reaches 77 tokens.
The length of 77 was set to balance performance and computational efficiency. Different software will have different behavior if more then 77 tokens are used:
- The first 77 tokens are retained and the rest are cut out.
- The entire prompt is broken into chunks of 75, start and end tokens are added, and each chunk is processed in order. This is the method BetterWaifu uses.
Text encoding
Read on to find out more about the image generation process.
Stable Diffusion converts the token sequence into a text representation. To use the text representation for guiding image generation, Stable Diffusion ensures that the text representation contains the information related to the image depicted in the prompt. This is done by using a special neural network called CLIP.
CLIP, which consists of an image encoder and a text encoder, is trained to encode an image and its text description into vectors that are similar to each other. Therefore, the text representation for a prompt computed by CLIP’s text encoder is likely to contain information about the images described in the prompt. You can display the visual explanations by clicking the Text Encoder above.
Noise Prediction
At each timestep, a neural network called UNet predicts noise in the image representation of the current timestep. UNet takes three inputs:
- Image representation of the current timestep
- Text representation of the prompt to guide what noise should be removed from the current image representation to generate an image adhering to the text prompt
- Timestep to indicate the amount of noise remaining in the current image representation
In other words, UNet predicts a prompt-conditioned noise in the current image representation under the guidance of the text prompt’s representation and timestep.
However, even though we condition the noise prediction with the text prompt, the generated image representation usually does not adhere strongly enough to the text prompt. To improve the adherence, Stable Diffusion measures the impact of the prompt by additionally predicting generic noise conditioned on an empty prompt (” “) and subtracting it from the prompt-conditioned noise:
impact of prompt = prompt-conditioned noise – generic noise
In other words, the generic noise contributes to better image quality, while the impact of the prompt contributes to the adherence to the prompt. The final noise is a weighted sum of them controlled by a value called guidance scale:
generic noise + guidance scale x impact of prompt
A guidance scale of 0 means no adherence to the text prompt, while a guidance scale of 1 means using the original prompt-conditioned noise. Larger guidance scales result in stronger adherence to the text prompt, while too large values can lower the image quality. Change the guidance scale value in Diffusion Explainer and see how it changes the generated images.
Noise Removal
Stable Diffusion then decides how much of the predicted noise to actually remove from the image, as determined by an algorithm called scheduler. Removing small amounts of noise helps refine the image gradually and produce sharper images.
The scheduler makes this decision by accounting for the total number of timesteps. The downscaled noise is then subtracted from the image representation of the current timestep to obtain the refined representation, which becomes the image representation of the next timestep:
image representation of timestep t+1 = image representation of timestep t – downscaled noise