Introduction

Stable Diffusion is a model to generate images from text (aka. prompts). In this guide I will show you:

  • How to generate an image using the Pipeline API.
  • The meaning of parameters to the Pipeline API and how to make better images by adjusting them.
  • How to get great images by tuning the text propmts.
  • How to add a final touch to bring your image to a next level.

Hopefully by the end of this tutorial, you will be able to generate high quality images by yourself!

Setup

To begin our journey, first we need to install some packages.

  • diffusers This is the package that we are mainly using for generating images.
  • transformers This is the package to encode text into embeddings.
  • And a few other supporting packages to work with the above packages.
!pip install -Uqq diffusers transformers ftfy scipy accelerate gradio xformers triton==2.0.0.dev20221120
import pathlib
import huggingface_hub
if not pathlib.Path('/root/.huggingface/token').exists():
  huggingface_hub.notebook_login()
Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful

Utility Functions

import PIL
import math

def image_grid(imgs, rows=None, cols=None) -> PIL.Image.Image:
  n_images = len(imgs)
  if not rows and not cols:
    cols = math.ceil(math.sqrt(n_images))
  if not rows:
    rows = math.ceil(n_images / cols)
  if not cols:
    cols = math.ceil(n_images / rows)

  w, h = imgs[0].size
  grid = PIL.Image.new('RGB', size=(cols*w, rows*h))

  for i, img in enumerate(imgs):
      grid.paste(img, box=(i%cols*w, i//cols*h))
  return grid

The Text to Image Pipeline

First let's try to generate some images using the diffusers pipeline. A pipeline is an encapsulation class introduced by huggingface. It usually encapsulates multiple models and the connections between them, to provide an easy to use API that generates output from input.

Think of the StableDiffusionPipeline it as a black box like this:

#@markdown Pipeline

import graphviz

dot = graphviz.Digraph('pipeline', comment='Pipeline')
dot.node('text', 'Text')
dot.node('pipeline', 'Pipeline', shape='box')
dot.node('image', 'Image')
dot.edge('pipeline', 'image', constraint='false')
dot.edge('text', 'pipeline', constraint='false')
display(dot)
pipeline text Text pipeline Pipeline text->pipeline image Image pipeline->image

Initializing the Pipeline

There are two important params to intialize a Pipeline. a model_id and a revision.

  • The model_id can be either a huggingface model id or some path in your file system. To find which ones to use, go to huggingface and look for model id at the top. Later when we train our own models we will pass in the path to our trained model.
  • The revision can be set fp16 to save GPU memory by using 16 bit numbers.
import torch
from diffusers import StableDiffusionPipeline

model_id = "stabilityai/stable-diffusion-2-base" #@param ["stabilityai/stable-diffusion-2-base", "stabilityai/stable-diffusion-2", "CompVis/stable-diffusion-v1-4", "runwayml/stable-diffusion-v1-5"] {type:"string"}
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pipe = StableDiffusionPipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16).to(device)

Generating image using the text to image pipeline

To generate an image from text, we just call the pipe() method. There are a few arguments to the method:

  • prompt A text describing the thing you want the image to be.
  • negative_prompt A text describing the features that you don't want the image to have.
  • generator This is a random number generator. By default the generator is None, and the output is random for same prompts. A fixed seed allows repeatable experiment (same input-> same output).
  • width and height the dimension of output image.
  • guidance_scale A scale determining the extent of how your image accurately matches your prompt. In practice I find it not very useful to tune this, just use a 7.5 you will be fine.
  • num_inference_steps How many steps to run the diffusion algorithm(will explain later). The more steps, the better the image quality, and the more time it takes to generate an image.
def text2image(pipe,
               prompt: str,
               seed: int,
               return_grid=False,
               grid_size=None,
               **kwargs):
  generator = torch.Generator(device="cuda").manual_seed(seed)

  with torch.autocast("cuda"):
    images = pipe(prompt,
                  generator=generator,
                  **kwargs).images

  if len(images) == 1:
    return images[0]
  elif return_grid:
    return image_grid(images)
  else:
    return images

Baseline

First let's try using a prompt with all default parameters:

prompt = "a photo of a woman wearing a red dress"
negative_prompt = ""
num_images_per_prompt = 4
seed = 42
width = height = 512
num_inference_steps = 30
guidance_scale = 7.5


image = text2image(
    pipe,
    prompt=prompt,
    seed=seed,
    return_grid=True,
    width=width,
    height=height,
    num_inference_steps=num_inference_steps,
    guidance_scale=guidance_scale,
    negative_prompt=negative_prompt,
    num_images_per_prompt=num_images_per_prompt)

display(image)