Stable Diffusion from Begginer to Master (1) Pipelines and Prompts
Introduction
Stable Diffusion is a model to generate images from text (aka. prompts). In this guide I will show you:
- How to generate an image using the Pipeline API.
- The meaning of parameters to the Pipeline API and how to make better images by adjusting them.
- How to get great images by tuning the text propmts.
- How to add a final touch to bring your image to a next level.
Hopefully by the end of this tutorial, you will be able to generate high quality images by yourself!
Setup
To begin our journey, first we need to install some packages.
diffusers
This is the package that we are mainly using for generating images.transformers
This is the package to encode text into embeddings.- And a few other supporting packages to work with the above packages.
!pip install -Uqq diffusers transformers ftfy scipy accelerate gradio xformers triton==2.0.0.dev20221120
import pathlib
import huggingface_hub
if not pathlib.Path('/root/.huggingface/token').exists():
huggingface_hub.notebook_login()
import PIL
import math
def image_grid(imgs, rows=None, cols=None) -> PIL.Image.Image:
n_images = len(imgs)
if not rows and not cols:
cols = math.ceil(math.sqrt(n_images))
if not rows:
rows = math.ceil(n_images / cols)
if not cols:
cols = math.ceil(n_images / rows)
w, h = imgs[0].size
grid = PIL.Image.new('RGB', size=(cols*w, rows*h))
for i, img in enumerate(imgs):
grid.paste(img, box=(i%cols*w, i//cols*h))
return grid
The Text to Image Pipeline
First let's try to generate some images using the diffusers
pipeline.
A pipeline
is an encapsulation class introduced by huggingface. It usually encapsulates multiple models and the connections between them, to provide an easy to use API that generates output from input.
Think of the StableDiffusionPipeline
it as a black box like this:
#@markdown Pipeline
import graphviz
dot = graphviz.Digraph('pipeline', comment='Pipeline')
dot.node('text', 'Text')
dot.node('pipeline', 'Pipeline', shape='box')
dot.node('image', 'Image')
dot.edge('pipeline', 'image', constraint='false')
dot.edge('text', 'pipeline', constraint='false')
display(dot)
Initializing the Pipeline
There are two important params to intialize a Pipeline. a model_id
and a revision
.
- The
model_id
can be either a huggingface model id or some path in your file system. To find which ones to use, go to huggingface and look for model id at the top. Later when we train our own models we will pass in the path to our trained model. - The
revision
can be setfp16
to save GPU memory by using 16 bit numbers.
import torch
from diffusers import StableDiffusionPipeline
model_id = "stabilityai/stable-diffusion-2-base" #@param ["stabilityai/stable-diffusion-2-base", "stabilityai/stable-diffusion-2", "CompVis/stable-diffusion-v1-4", "runwayml/stable-diffusion-v1-5"] {type:"string"}
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pipe = StableDiffusionPipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16).to(device)
Generating image using the text to image pipeline
To generate an image from text, we just call the pipe()
method. There are a few arguments to the method:
prompt
A text describing the thing you want the image to be.negative_prompt
A text describing the features that you don't want the image to have.generator
This is a random number generator. By default the generator is None, and the output is random for same prompts. A fixed seed allows repeatable experiment (same input-> same output).width
andheight
the dimension of output image.guidance_scale
A scale determining the extent of how your image accurately matches your prompt. In practice I find it not very useful to tune this, just use a 7.5 you will be fine.num_inference_steps
How many steps to run the diffusion algorithm(will explain later). The more steps, the better the image quality, and the more time it takes to generate an image.
def text2image(pipe,
prompt: str,
seed: int,
return_grid=False,
grid_size=None,
**kwargs):
generator = torch.Generator(device="cuda").manual_seed(seed)
with torch.autocast("cuda"):
images = pipe(prompt,
generator=generator,
**kwargs).images
if len(images) == 1:
return images[0]
elif return_grid:
return image_grid(images)
else:
return images
prompt = "a photo of a woman wearing a red dress"
negative_prompt = ""
num_images_per_prompt = 4
seed = 42
width = height = 512
num_inference_steps = 30
guidance_scale = 7.5
image = text2image(
pipe,
prompt=prompt,
seed=seed,
return_grid=True,
width=width,
height=height,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
negative_prompt=negative_prompt,
num_images_per_prompt=num_images_per_prompt)
display(image)