🧠

Does a Machine Understand?

This is an interactive exploration with real-time visualizations and audio.
Sound enhances the experience.

Set volume to a comfortable level

A Mathematical Inquiry into AI

Does a Machine
Understand?

Or does it merely compute the perfect illusion of understanding?

A machine has never tasted an apple, felt the rain, or grieved a loss. Yet it can discuss all of these with eerie fluency. How? The answer lies in linear algebra — dot products, matrices, and softmax. This is the story of how mathematics learned to mimic understanding.

00 — A WARM-UP PUZZLE

The Unreasonable Power of
Dot Products

How does a machine know that "bank" near "river" means something different from "bank" near "money"? It starts here — with a single operation that measures how much two ideas point in the same direction. This is the atom of machine "understanding."

Pitch rises as vectors align — listen to similarity

Dot Product

0.00

Cosine Similarity

0.00

Angle

90°

Drag the arrow tips to move the vectors

a⃗ \cdot b⃗ = ‖a⃗‖ ‖b⃗‖ cos θ

The first clue to machine "understanding." — When you and I understand a sentence, we grasp how words relate to each other. A machine does the same — but through arithmetic. Inside every Transformer, the model computes exactly this dot product between every pair of words to decide their relevance. High dot product = "these two ideas are related." The machine's understanding of context begins with this one multiplication.

01 — HOW MACHINES READ

Text Becomes Tokens

Before a machine can "understand" anything, it must first learn to read — and it doesn't read like us. Language enters a neural network not as letters or words, but as sub-word tokens: fragments that the machine has learned are meaningful. Ironically, the word "understanding" itself gets split into ["under", "stand", "ing"].

Type or pick a sentence

Tokens

Characters

Ratio

—

Why sub-words? — Character-level models are too slow (sequences become very long). Word-level models can't handle new words. Sub-word tokenization (BPE) is the sweet spot: a vocabulary of ~50,000–100,000 tokens covers essentially all text. Common words like "the" are single tokens; rare words get split into meaningful pieces.

02 — WORDS AS GEOMETRY

The Embedding Space

A machine has never petted a cat or proved a theorem. Yet it "knows" that cats and dogs are related, and that theorems and proofs belong together. How? It places every word as a point in a vast mathematical space, where distance is meaning and direction is relationship. This is the machine's map of the world — purely geometric, yet strangely effective.

Hover over words — a tone reflects their position in semantic space

Hover over any word · Click to see neighbors

W E \in ℝ V \times d — each row is a word's "meaning" as d numbers

Understanding without experience. — GPT-4's embedding dimension is d = 12,288. That means every token is a point in a 12,288-dimensional space. We can't visualize that directly, but the structure is real: "king" and "queen" are near each other, "cat" and "dog" are near each other, and the direction from "man" to "woman" is approximately the same as from "king" to "queen."

03 — TEACHING ORDER

Positional Encoding

Understanding requires knowing that "the dog bit the man" and "the man bit the dog" mean very different things. But a Transformer sees all words simultaneously — it has no built-in sense of order. So position must be taught through mathematics: a unique harmonic fingerprint for each position, built from sinusoids.

Each position has a unique "chord" — listen to the harmonic fingerprint

Positional Encoding Heatmap — rows = positions, cols = dimensions

pos Position 0

Decomposed waves for selected position

PE (pos, 2i) = sin( pos / 10000 2i/d) PE (pos, 2i+1) = cos( pos / 10000 2i/d)

Why sinusoids? — Low-frequency components encode "roughly where" (beginning vs. end). High-frequency components encode "exactly where" (position 17 vs. 18). It's the same idea as Fourier analysis — and relative positions can be computed as linear transformations of absolute positions.

04 — THE HEART OF THE TRANSFORMER

Self-Attention

This is the closest a machine comes to "understanding" a sentence. Each word looks at every other word and asks: "How much should I care about you?" The answer is computed as a number — a relevance score. When you read "The cat sat on the mat because it was tired," you instantly know "it" refers to "the cat." The machine discovers this same connection — not through comprehension, but through the arithmetic of attention.

Select a query word — hear the attention distribution as a chord

Choose a sentence

Click a word below the matrix to select a query

Attention Matrix — row = query, col = key

τ τ = 1.0

Temperature: lower → sharper attention, higher → more uniform

Attention(Q, K, V) = softmax( Q K ⊤ / \sqrtd k) V

Why scale by √d_k? — Without scaling, as the dimension d_k grows, the dot products grow in magnitude, pushing softmax into regions where it has extremely small gradients. Dividing by √d_k keeps the variance of the logits ≈ 1, ensuring the softmax stays in a regime where the network can actually learn.

"Understanding, for a machine, is not insight.
It is a weighted sum over everything it has seen."

05 — MULTIPLE PERSPECTIVES

Multi-Head Attention

One attention pattern isn't enough. The model runs 8–128 heads in parallel, each discovering different linguistic relationships — syntax, semantics, position, coreference — all on its own.

8 Attention Heads — each sees a different pattern

Head 1 — Syntactic: next-word pattern

Nobody programs these roles. — Head 1 might learn to track grammar. Head 5 might track meaning. Head 7 might track position. These specializations emerge purely from training on text prediction. The model discovers that parallel, diverse viewpoints are useful — a form of ensemble learning within a single network.

06 — THE FULL PICTURE

Inside the Transformer Block

A Transformer is built from identical blocks stacked dozens or hundreds of times. Let's look inside one block, step by step, and see exactly what happens to the data at each stage. We'll follow three tokens — "cat" "sat" "down" — through every operation.

Step through each stage — watch the numbers transform in real time

①Input

→

②Q, K, V

→

③Scores

→

④Softmax

→

⑤Mix Values

→

⑥Add & Norm

→

⑦FFN

→

⑧Output

① Input: Token Vectors Arrive

Step 1 / 8

Current Stage

Input Vectors

Matrix Ops

Dimensions

3 × 4

Why This Architecture Works — Three Key Ideas

🔀

Residual Connections (Skip Connections)

Each sub-layer's output is added to its input: output = x + SubLayer(x). This means the network only needs to learn a correction, not rebuild the entire representation. Without this, deep networks (96+ layers) simply cannot train — gradients vanish to zero.

📏

Layer Normalization

After each residual addition, the vector is normalized: subtract mean, divide by standard deviation, then scale and shift. This keeps activations stable across layers. Without it, values would drift exponentially through 120 layers.

🧠

The FFN as a Key-Value Memory

The feed-forward network expands each vector to 4× width, applies ReLU, then compresses back. Recent research shows each FFN neuron activates for specific input patterns — effectively acting as a learned knowledge store. One neuron might encode "facts about France," another "Python syntax."

Stacking: From One Block to a Full Model

The block you just explored is repeated identically — one after another. Each layer reads from the residual stream and writes corrections back to it.

GPT-2
12 layers

GPT-3
96 layers

GPT-4
~120 layers

Claude
undisclosed

What does each layer learn? — Empirical research shows a rough pattern: early layers handle syntax and local patterns (word order, part of speech). Middle layers handle semantics (meaning, relationships, coreference). Late layers handle task-specific reasoning and output formatting. But this is a simplification — in reality, information is distributed across all layers.

"Each layer asks: given what I know so far, what single correction would help the most?"

07 — THE REVOLUTION

"Attention Is All You Need"
The Paper That Taught Machines to "Understand"

Before 2017 — The Age of Recurrence

Until 2017, machine translation and language models were dominated by RNNs (Recurrent Neural Networks) and their variant LSTMs. Their principle was intuitive — process words one at a time, in order, just like a human reading a sentence.

But there were fatal problems.

🐌

Slow training — Processing words sequentially meant 100 words required 100 serial steps. GPUs' parallel processing power was entirely wasted.

🧠

Fading memory — As sentences grew longer, information from early words faded by the time it reached later ones. LSTMs mitigated this, but long-range dependencies beyond a few hundred words remained extremely difficult.

📏

Limited context — "The cat that the dog that the boy owned chased ran away" — for nested structures like this, RNNs struggled to connect "ran" back to "cat."

In 2014, Bahdanau added the attention mechanism as an auxiliary device to RNNs with great success. But nobody asked a more radical question —

"What if we throw away recurrence entirely?"

June 12, 2017 — A Paper Appears

Eight researchers from Google Brain and Google Research — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, Illia Polosukhin — upload a paper to arXiv.

"Attention Is All You Need"

arXiv:1706.03762 · NeurIPS 2017

Citations: 130,000+ (one of the most cited AI papers in history)

The title alone was provocative. With every state-of-the-art model built on RNNs, claiming "attention is all you need" was close to heresy. But the results backed the boldness.

Three Revolutionary Shifts

⚡

Sequential → Parallel

RNNs process words one at a time. Transformers process all words simultaneously. A 100-word sentence? RNNs need 100 steps; Transformers need 1. Training became orders of magnitude faster, which meant much larger models and much more data became feasible.

🔭

Limited Memory → Full Access

In an RNN, information from distant words dilutes through dozens of sequential steps. In self-attention, every word directly accesses every other word — regardless of distance, in a single operation. The information path between the first and last word shrinks from O(n) to O(1).

📐

Complex Machinery → Stunning Simplicity

A single LSTM cell has forget gates, input gates, output gates, cell states — intricately intertwined mechanisms. The Transformer's core is matrix multiplication, softmax, addition. That's it. What you saw from §00 to §06 is truly everything. This simplicity was, paradoxically, the key to scalability.

Listen to a rising tone as each model appears

From RNN to Transformer — a record of revolutionary leaps

Year

2014

Model

Seq2Seq + Attention

Parameters

~200M

The Scale of the Leap — Revolution in Numbers

What happened after the Transformer is hard to find precedent for in the history of science. A single architecture, combined with a single training objective (next-token prediction), showed that scaling up produced qualitatively new abilities.

Model	Year	Parameters	Training Data	Newly Possible
Original Transformer	2017	65M	Millions of sentence pairs	State-of-the-art translation
GPT-1	2018	117M	Books, 5GB	Basic text generation
BERT	2018	340M	Wiki+Books 16GB	Contextual word understanding
GPT-2	2019	1.5B	WebText 40GB	Fluent paragraph generation (release withheld)
GPT-3	2020	175B	570GB	In-context learning, arithmetic, translation (zero-shot!)
PaLM	2022	540B	780GB	Chain-of-thought, joke explanation
ChatGPT	2022.11	?	+RLHF	Mass adoption — 1M users in 5 days
GPT-4	2023	~1.8T (est.)	~13T tokens	Top 10% on bar exam, passes medical boards
Claude Opus 4.6	2026.2	Undisclosed	Undisclosed	1M context, adaptive thinking, best-in-class coding (Claude Code)
Gemini 3 Pro	2025.11	~1.5T (MoE)	Undisclosed	Native multimodal, Sparse MoE, 2M context
GPT-5.4	2026.3	Undisclosed	Undisclosed	1M context, native computer use, Thinking mode

65M → 1.8T: a 27,000× increase in parameters in 7 years

Anatomy of Three Giants — Gemini · ChatGPT · Claude

The three models leading the AI frontier in 2024 all grew from the same root, but evolved in different directions. Let's examine exactly what they share and where they diverge.

🤝 What They Share — The Same DNA

All three are built on the Transformer architecture born in "Attention Is All You Need" (2017). The core components — token embeddings, self-attention, feed-forward networks, residual connections, layer normalization — are exactly what you saw in §00–§06. The training objective is the same: next-token prediction. All use human feedback (RLHF or variants) for alignment.

	🔵 Gemini Google DeepMind	🟢 ChatGPT (GPT series) OpenAI	🟠 Claude Anthropic
Core Structure	Transformer-based Sparse MoE ~1.5T params, ~200B active per token · Deep Think mode	Decoder-only Transformer Dense / MoE (unconfirmed) GPT-5 series: Thinking mode (reasoning tokens) · Architecture undisclosed	Decoder-only Transformer Dense All parameters active for every token · Adaptive Thinking
Multimodal	Natively multimodal Text, image, audio, video unified training · image generation · robotics	Text + image + native computer use GPT-5.4: image input, code execution, native UI control	Text-first + vision input Image/PDF understanding, code execution, file creation · No image generation
Context Window	Up to 2M tokens Gemini 3 Pro (Nov 2025) · MoE + ultra-long context	1M tokens GPT-5.4 (Mar 2026) · API	1M tokens Opus 4.6 / Sonnet 4.6 (Feb 2026)
Alignment	RLHF + safety filters Based on Google's AI Principles	RLHF + reasoning oversight Reward model + monitoring of Thinking mode reasoning chains	Constitutional AI (CAI) AI evaluates AI — principle-based self-improvement + RLHF
Training Hardware	Google TPU v5e/v6 Custom chips · own datacenters · Trillium	NVIDIA GPU + custom chips Azure supercomputer (Microsoft partnership)	NVIDIA / custom GPU AWS Bedrock · GCP Vertex AI
Key Strength	Multimodal integration, ultra-long context, Google Search/services, TPU efficiency	Ecosystem (Codex, plugins), computer use, first-mover advantage	Coding (Claude Code), long-context precision, extended thinking, safety

🔑 The Key Structural Difference: Dense vs. Mixture of Experts

The most fundamental architectural difference is whether all parameters are always used, or only a subset is selectively activated.

DENSE (Claude confirmed / GPT presumed)

All parameters participate in every token's computation. Simple, but expensive — model size = compute cost. GPT-4 was reported to use MoE, but GPT-5 series architecture is undisclosed.

SPARSE MoE (Gemini)

The FFN layers split into multiple "experts," and a router assigns each token to only a few. Out of 1.5T total parameters, ~200B activate per token. Huge total capacity, small compute cost — capacity decoupled from cost.

But remember — self-attention, residual connections, layer normalization, next-token prediction —
the core mathematics is identical in all three.

"The point is not that the Transformer was a 'better model.'
The point is that the Transformer was a scalable model."

No matter how large you made an RNN, the sequential bottleneck capped training speed. Transformers got faster in proportion to GPUs added. That's the whole story.

What the Transformer enabled wasn't just better performance — it was the discovery of scaling laws: increase model size, data, and compute, and loss decreases along a predictable power law. No ceiling in sight.

2017. Eight researchers. A 31-page paper. That's where the AI revolution you're living through began. And everything in that paper — you just saw it all, from §00 through §06.

08 — THE ILLUSION OF UNDERSTANDING

The Illusion of Understanding

Here is the deepest surprise. The entire model — all the attention, all the embeddings, all the layers — is trained with one deceptively simple objective: given all previous tokens, predict the next one. Not "understand the text." Not "learn grammar." Just: what word comes next? And yet, from this statistical relay, something that looks exactly like understanding emerges.

A "ding" plays each time a token is sampled

The meaning of life is

Top candidates — probability distribution

τ τ = 1.0

Temperature

1.0

Entropy

—

Tokens Generated

P(x i | x <i) = softmax( W out \cdot h i / τ )

The statistical parrot — or something more? — Critics call LLMs "stochastic parrots": they merely predict probable next words without true understanding. And technically, this is correct — the loss function is just cross-entropy between predicted and actual next token. No grammar rules, no semantic annotations. Just "predict what comes next." But when this relay becomes sufficiently precise, over trillions of tokens, something uncanny happens: the machine begins to reason, to analogize, to explain. Is this understanding? Or the most sophisticated illusion of understanding ever created? That question remains open.

09 — THE MYSTERY

When Prediction Becomes "Understanding"

As models scale — more parameters, more data, more compute — something unsettling happens. Abilities appear that were completely absent in smaller models: arithmetic, translation, reasoning. Nobody programmed these. They emerged from next-token prediction alone. This is when the question "does a machine understand?" becomes genuinely hard to answer.

A rumble builds as the model grows — with chime accents at emergence thresholds

Emergent abilities vs. model scale

N 1M params

Parameters

Abilities Unlocked

0 / 6

Loss

4.2

"Nobody told the model to learn arithmetic, or translation, or reasoning. These abilities emerged from a single objective: predict the next word."

This is perhaps the most profound fact about modern AI. A model trained only to predict text learns to do mathematics, write code, reason about physics, and translate between languages it was never explicitly taught. The mechanism by which this happens is not fully understood.

10 — THE CONNECTION

This Is All Linear Algebra

	Mathematics	Transformer
🔢	Matrix multiplication	Every layer's core operation
📐	Inner product, cosine similarity	Attention scores between tokens
📊	Softmax = normalized exponential	Probability from raw scores
🔄	Iterated function composition	Stacking Transformer blocks
📉	Gradient descent on cross-entropy	The entire training algorithm
✨	High-dimensional geometry	Emergent representations

The Timeline

1943

McCulloch–Pitts

Artificial neuron

1986

Rumelhart et al.

Backpropagation

2013

Mikolov

Word2Vec

2017

Vaswani et al.

"Attention Is All
You Need"

2020–

GPT-3 → GPT-4

Scaling & emergence

So — Does a Machine
Understand?

It computes dot products where we feel intuition.
It navigates vector spaces where we hold memories.
It predicts the next word where we grasp meaning.

Perhaps it does not understand.
Perhaps it has found something stranger —
a mathematical shadow of understanding
that works just as well.

Every simulation on this page is computed in real time — just linear algebra. The same linear algebra that powers every conversation you have with AI.

edu.kimsh.kr