05b β THE BREAKTHROUGHS
DDPM and Stable Diffusion β What Are They, Really?
DDPM β The Moment It All Clicked (2020)
The idea of reversing diffusion had existed since 2015, but the results were blurry and unconvincing. Then in 2020, Jonathan Ho, Ajay Jain, and Pieter Abbeel at UC Berkeley published a paper called "Denoising Diffusion Probabilistic Models" β DDPM for short β and everything changed.
Their insight was beautifully simple: don't try to predict the clean image directly. Instead, just predict the noise.
Here's what that means. Take a photo. Add a known amount of Gaussian noise to it. Now show the noisy result to a neural network and ask: "What noise was added?" If the network can answer that correctly, you can subtract the predicted noise β and recover a slightly cleaner image. Repeat this 1,000 times, starting from pure static, and a brand-new image appears from nothing.
The training objective turned out to be absurdly simple β just a mean squared error:
Loss = β Ξ΅ β Ρθ(xt, t) βΒ²
In plain English: measure how far off the network's noise prediction (Ρθ) is from the actual noise (Ξ΅) that was added. That's it. This one equation trained a model that produced images rivaling GANs for the first time β and it was far more stable to train.
The catch? DDPM was slow. Generating one image required running the neural network 1,000 times in sequence, each time denoising a little more. A single image could take minutes on a powerful GPU.
Stable Diffusion β Making It Fast and Free (2022)
Stable Diffusion is the name of a specific open-source model created by a team at LMU Munich (Robin Rombach, Andreas Blattmann, and others) in collaboration with Stability AI. It solved DDPM's speed problem with one brilliant idea:
Don't diffuse in pixel space. Diffuse in a compressed space β the machine's unconscious.
A 512Γ512 image has 786,432 pixel values. Running diffusion on all of them is expensive. So Stable Diffusion first compresses the image into a tiny latent representation β just 64Γ64Γ4 = 16,384 numbers β using a pre-trained autoencoder. Then it runs the diffusion process (forward and reverse) entirely in this compressed space. Finally, it decodes the result back into a full image.
β encode β
π§
64Γ64 latent
48Γ smaller
β diffuse β
β¨
64Γ64 denoised
latent
β decode β
This made generation ~50 times faster. But there is something poetic about this latent space: it is a vast mathematical ocean where every concept humanity has ever photographed β every face, landscape, animal, texture β exists as a point. It is, in a very real sense, the machine's collective unconscious. But that's only half the story. Stable Diffusion also added a text encoder (CLIP) that understands language. When you type "a cat sitting on a mountain at sunset," CLIP translates those words into a numerical vector that guides the denoising process at every step β like dropping a pebble into the machine's unconscious ocean, creating ripples that steer the noise toward an image matching your words. The prompt is the trigger. The latent space is the unconscious. The denoised image is the dream.
The result: type a sentence, wait a few seconds, get a photorealistic image. And because Stability AI released the model weights openly, anyone in the world could use it, modify it, and build on it β triggering the explosion of AI-generated art you see today.
The Family Tree
How they all fit together:
2015
Diffusion Models (Sohl-Dickstein) β first proof that reversing diffusion works, but blurry results
2020
DDPM (Ho et al.) β "just predict the noise" β first sharp, high-quality images
2021
DDIM, Guided Diffusion β faster sampling, text/class guidance
2022
Stable Diffusion (Rombach et al.) β latent space + text encoder + open-source β the revolution
2023β
DALLΒ·E 3, Midjourney v5, SDXL, Sora β same core idea, scaled up to video and beyond