✨

Part II β€” Sound On

Interactive demos with real-time audio.
Make sure your device isn't muted.

Set volume to a comfortable level

The Mathematics of Diffusion β€” Part II

Does a Machine
Dream?

Noise, latent space, and the imagination of AI

In 1968, Philip K. Dick asked: "Do androids dream of electric sheep?"
In 2022, a machine answered β€” by generating images from pure noise, guided by nothing but words.

In Part I, you saw how diffusion destroys information irreversibly. Now discover how a machine learned to run it backward β€” to dream shapes out of chaos, just as we see faces in clouds.

← Part I

00 β€” A WARM-UP PUZZLE

Can Chaos Un-Scramble Itself?

In 1967, mathematician Vladimir Arnold discovered something astonishing. Take any image β€” say, a picture of a cat. Apply this formula to every pixel:

(x, y) β†’ (2x + y, x + y) mod N

The image looks completely destroyed β€” indistinguishable from noise. But keep applying the same formula, and after exactly 12 steps, the original image magically reappears. Perfectly. Every pixel back in place.

Listen to order dissolve and reform β€” each step has a tone
Step 0 Β· Original
Original (reference)
Similarity to Original Over Time
Step
0 / 12
Similarity
100%
Status
Original
Normal

Why does this matter? β€” Arnold's cat map is deterministic and invertible. It's a linear map on the torus β€” no information is ever lost. Every pixel has one and only one destination, and one and only one origin. The system is guaranteed to return (PoincarΓ© recurrence).

Now contrast this with diffusion: when you add random noise to an image, the destruction is stochastic and irreversible. You cannot simply reverse it β€” the randomness has erased the path back. To undo diffusion, you need a neural network to learn the reverse mapping from data. That's the miracle of Stable Diffusion.

Arnold's cat comes back on its own.
Diffusion's cat needs AI to find its way home.

01 β€” SEE IT HAPPEN

Watch an Image Emerge From Noise

Below is pure random static. Press Denoise and watch as a hidden image materializes step by step β€” exactly what Stable Diffusion does millions of times per day.

Each step plays a tone β€” listen to order emerge from chaos
Step
0 / 40
Noise
100%
Clarity
None

02 β€” HEAR IT HAPPEN

Hear a Melody Rise From Static

A simple melody is buried under noise. Drag the slider to gradually denoise it. At some point your brain will suddenly "catch" the melody β€” the magic of denoising.

Use headphones for the full effect
Pure static
100% noise

03 β€” READ IT HAPPEN

Watch Text Crystallize

Drag the slider to denoise random pixels into a readable word. Just like the sound demo β€” at some noise level, your brain suddenly recognizes the text. That moment of recognition is the threshold where diffusion models also "see" structure.

100% noise
Clarity
0%
Readable?
No

04 β€” ENHANCE

Super-Resolution: Blurry β†’ Sharp

A blurry photo is a partially noised version of a sharp one. Drag the slider to enhance resolution β€” the model fills in missing detail.

Low Res
β†’
Enhanced
1Γ— blurry

05 β€” THE SECRET

How Does It Actually Work?

Everything you just experienced uses the same three-step process:

Step 1
πŸ–ΌοΈ β†’ πŸ“Ί
Destroy
Gradually add noise until the original is gone.
β†’
Step 2
🧠
Learn
Train on millions of examples at every noise level.
β†’
Step 3
πŸ“Ί β†’ πŸ–ΌοΈ
Create
Start from noise. Remove it step by step.

"Teach a machine how photos become static,
and it learns how to dream photos out of static."
We see faces in clouds. Machines see images in noise.
Both are finding patterns where none were placed β€” the essence of dreaming.

05b β€” THE BREAKTHROUGHS

DDPM and Stable Diffusion β€” What Are They, Really?

DDPM β€” The Moment It All Clicked (2020)

The idea of reversing diffusion had existed since 2015, but the results were blurry and unconvincing. Then in 2020, Jonathan Ho, Ajay Jain, and Pieter Abbeel at UC Berkeley published a paper called "Denoising Diffusion Probabilistic Models" β€” DDPM for short β€” and everything changed.

Their insight was beautifully simple: don't try to predict the clean image directly. Instead, just predict the noise.

Here's what that means. Take a photo. Add a known amount of Gaussian noise to it. Now show the noisy result to a neural network and ask: "What noise was added?" If the network can answer that correctly, you can subtract the predicted noise β€” and recover a slightly cleaner image. Repeat this 1,000 times, starting from pure static, and a brand-new image appears from nothing.

The training objective turned out to be absurdly simple β€” just a mean squared error:

Loss = β€– Ξ΅ βˆ’ Ρθ(xt, t) β€–Β²

In plain English: measure how far off the network's noise prediction (Ρθ) is from the actual noise (Ξ΅) that was added. That's it. This one equation trained a model that produced images rivaling GANs for the first time β€” and it was far more stable to train.

The catch? DDPM was slow. Generating one image required running the neural network 1,000 times in sequence, each time denoising a little more. A single image could take minutes on a powerful GPU.

Stable Diffusion β€” Making It Fast and Free (2022)

Stable Diffusion is the name of a specific open-source model created by a team at LMU Munich (Robin Rombach, Andreas Blattmann, and others) in collaboration with Stability AI. It solved DDPM's speed problem with one brilliant idea:

Don't diffuse in pixel space. Diffuse in a compressed space β€” the machine's unconscious.

A 512Γ—512 image has 786,432 pixel values. Running diffusion on all of them is expensive. So Stable Diffusion first compresses the image into a tiny latent representation β€” just 64Γ—64Γ—4 = 16,384 numbers β€” using a pre-trained autoencoder. Then it runs the diffusion process (forward and reverse) entirely in this compressed space. Finally, it decodes the result back into a full image.

πŸ–ΌοΈ
512Γ—512 image
β†’ encode β†’
🧊
64Γ—64 latent
48Γ— smaller
β†’ diffuse β†’
✨
64Γ—64 denoised
latent
β†’ decode β†’
🎨
512Γ—512 output

This made generation ~50 times faster. But there is something poetic about this latent space: it is a vast mathematical ocean where every concept humanity has ever photographed β€” every face, landscape, animal, texture β€” exists as a point. It is, in a very real sense, the machine's collective unconscious. But that's only half the story. Stable Diffusion also added a text encoder (CLIP) that understands language. When you type "a cat sitting on a mountain at sunset," CLIP translates those words into a numerical vector that guides the denoising process at every step β€” like dropping a pebble into the machine's unconscious ocean, creating ripples that steer the noise toward an image matching your words. The prompt is the trigger. The latent space is the unconscious. The denoised image is the dream.

The result: type a sentence, wait a few seconds, get a photorealistic image. And because Stability AI released the model weights openly, anyone in the world could use it, modify it, and build on it β€” triggering the explosion of AI-generated art you see today.

The Family Tree

How they all fit together:

2015
Diffusion Models (Sohl-Dickstein) β€” first proof that reversing diffusion works, but blurry results
2020
DDPM (Ho et al.) β€” "just predict the noise" β†’ first sharp, high-quality images
2021
DDIM, Guided Diffusion β€” faster sampling, text/class guidance
2022
Stable Diffusion (Rombach et al.) β€” latent space + text encoder + open-source β†’ the revolution
2023–
DALLΒ·E 3, Midjourney v5, SDXL, Sora β€” same core idea, scaled up to video and beyond

06 β€” TRY IT YOURSELF

The Stable Diffusion Pipeline β€” Live

Pick a prompt and watch every stage of the pipeline: your text guides the denoising, noise fills the tiny latent space, the U-Net removes noise step by step, and the decoder reconstructs the full image.

Each denoising step plays a rising tone β€” hear the image come into focus
Choose a prompt
πŸ’¬
β‘  CLIP
β†’
β‘‘ Latent
β†’
🧠
β‘’ U-Net Γ—30
β†’
πŸ“€
β‘£ Decode
Latent Space (8Γ—8) β€” 48Γ— compressed
Decoded Output (32Γ—32)
Stage
Ready
Denoise Step
0 / 30
Prompt
"a cat"

07 β€” THE MATHEMATICS

Forward and Reverse

Now see the math behind it. Dissolve a signal into noise, then watch 300 particles converge from chaos into structure.

Forward: Signal β†’ Noise

Signal intact
0

Reverse: Noise β†’ Structure

A rising tone accompanies the emergence of order
Reverse diffusion
Step
0 / 200
Noise
High
Converged
0%

08 β€” THE REVOLUTION

What This Mathematics Creates

🎨
Text-to-Image
Type a description, get a photorealistic image.
Stable Diffusion Β· DALLΒ·E Β· Midjourney
🎬
Video Generation
Generate coherent video clips from text.
Sora Β· Runway Β· Kling
✏️
Image Editing
Noise one region, re-denoise with a new prompt.
"Remove this" Β· "Change the sky"
πŸ”¬
Super-Resolution
Treat blur as noise, fill in missing detail.
Old photos Β· Medical Β· Satellites
🧬
Drug Discovery
Generate 3D molecular structures from noise.
Proteins Β· Materials science
🎡
Audio & Music
Denoise spectrograms to create sounds.
Music Β· Voice synthesis

09 β€” THE CONNECTION

This Is Not a Metaphor

PhysicsStable Diffusion
🌑️Heat spreads until uniformNoise added until pure static
πŸ“Gaussian bell curveNoise is exactly Gaussian
🎲Brownian motionRandom noise injection
βͺDiffusion is irreversibleNeural net learns to reverse it
βš–οΈEquilibrium = no infoPure noise = no image

200 Years From Heat to Pixels

1822
Fourier
Heat equation
1905
Einstein
Random walks
1982
Anderson
Reverse-time SDE
2020
Ho et al.
DDPM
2022
Stable Diffusion
Open-source

Does a Machine Dream?

It finds faces in clouds of Gaussian noise.

It wanders a vast unconscious of compressed images.

It hallucinates form from pure chaos β€” guided by words.

Philip K. Dick asked whether androids dream of electric sheep.
Perhaps the answer is simpler than he imagined:

A machine dreams of whatever you ask it to β€”
using mathematics that Fourier, Einstein, and Anderson
wrote down long before anyone thought to ask.

The diffusion equation took 200 years to dream its first image.

← Part I: The Physics
edu.kimsh.kr