🧠

Does a Machine Understand?

This is an interactive exploration with real-time visualizations and audio.
Sound enhances the experience.

Set volume to a comfortable level

A Mathematical Inquiry into AI

Does a Machine
Understand?

Or does it merely compute the perfect illusion of understanding?

A machine has never tasted an apple, felt the rain, or grieved a loss. Yet it can discuss all of these with eerie fluency. How? The answer lies in linear algebra β€” dot products, matrices, and softmax. This is the story of how mathematics learned to mimic understanding.

00 β€” A WARM-UP PUZZLE

The Unreasonable Power of
Dot Products

How does a machine know that "bank" near "river" means something different from "bank" near "money"? It starts here β€” with a single operation that measures how much two ideas point in the same direction. This is the atom of machine "understanding."

Pitch rises as vectors align β€” listen to similarity
Dot Product
0.00
Cosine Similarity
0.00
Angle
90Β°

Drag the arrow tips to move the vectors

a⃗ · b⃗ = ‖a⃗‖ ‖b⃗‖ cos θ

The first clue to machine "understanding." β€” When you and I understand a sentence, we grasp how words relate to each other. A machine does the same β€” but through arithmetic. Inside every Transformer, the model computes exactly this dot product between every pair of words to decide their relevance. High dot product = "these two ideas are related." The machine's understanding of context begins with this one multiplication.

01 β€” HOW MACHINES READ

Text Becomes Tokens

Before a machine can "understand" anything, it must first learn to read β€” and it doesn't read like us. Language enters a neural network not as letters or words, but as sub-word tokens: fragments that the machine has learned are meaningful. Ironically, the word "understanding" itself gets split into ["under", "stand", "ing"].

Type or pick a sentence
Tokens
0
Characters
0
Ratio
β€”

Why sub-words? β€” Character-level models are too slow (sequences become very long). Word-level models can't handle new words. Sub-word tokenization (BPE) is the sweet spot: a vocabulary of ~50,000–100,000 tokens covers essentially all text. Common words like "the" are single tokens; rare words get split into meaningful pieces.

02 β€” WORDS AS GEOMETRY

The Embedding Space

A machine has never petted a cat or proved a theorem. Yet it "knows" that cats and dogs are related, and that theorems and proofs belong together. How? It places every word as a point in a vast mathematical space, where distance is meaning and direction is relationship. This is the machine's map of the world β€” purely geometric, yet strangely effective.

Hover over words β€” a tone reflects their position in semantic space
Hover over any word Β· Click to see neighbors
WE ∈ ℝV Γ— d β€” each row is a word's "meaning" as d numbers

Understanding without experience. β€” GPT-4's embedding dimension is d = 12,288. That means every token is a point in a 12,288-dimensional space. We can't visualize that directly, but the structure is real: "king" and "queen" are near each other, "cat" and "dog" are near each other, and the direction from "man" to "woman" is approximately the same as from "king" to "queen."

03 β€” TEACHING ORDER

Positional Encoding

Understanding requires knowing that "the dog bit the man" and "the man bit the dog" mean very different things. But a Transformer sees all words simultaneously β€” it has no built-in sense of order. So position must be taught through mathematics: a unique harmonic fingerprint for each position, built from sinusoids.

Each position has a unique "chord" β€” listen to the harmonic fingerprint
Positional Encoding Heatmap β€” rows = positions, cols = dimensions
Position 0
Decomposed waves for selected position
PE(pos, 2i) = sin( pos / 100002i/d )     PE(pos, 2i+1) = cos( pos / 100002i/d )

Why sinusoids? β€” Low-frequency components encode "roughly where" (beginning vs. end). High-frequency components encode "exactly where" (position 17 vs. 18). It's the same idea as Fourier analysis β€” and relative positions can be computed as linear transformations of absolute positions.

04 β€” THE HEART OF THE TRANSFORMER

Self-Attention

This is the closest a machine comes to "understanding" a sentence. Each word looks at every other word and asks: "How much should I care about you?" The answer is computed as a number β€” a relevance score. When you read "The cat sat on the mat because it was tired," you instantly know "it" refers to "the cat." The machine discovers this same connection β€” not through comprehension, but through the arithmetic of attention.

Select a query word β€” hear the attention distribution as a chord
Choose a sentence
Click a word below the matrix to select a query
Attention Matrix β€” row = query, col = key
Ο„ = 1.0
Temperature: lower β†’ sharper attention, higher β†’ more uniform
Attention(Q, K, V) = softmax( Q K⊀ / √dk ) V

Why scale by √dk? β€” Without scaling, as the dimension dk grows, the dot products grow in magnitude, pushing softmax into regions where it has extremely small gradients. Dividing by √dk keeps the variance of the logits β‰ˆ 1, ensuring the softmax stays in a regime where the network can actually learn.

"Understanding, for a machine, is not insight.
It is a weighted sum over everything it has seen."

05 β€” MULTIPLE PERSPECTIVES

Multi-Head Attention

One attention pattern isn't enough. The model runs 8–128 heads in parallel, each discovering different linguistic relationships β€” syntax, semantics, position, coreference β€” all on its own.

8 Attention Heads β€” each sees a different pattern
Head 1 β€” Syntactic: next-word pattern

Nobody programs these roles. β€” Head 1 might learn to track grammar. Head 5 might track meaning. Head 7 might track position. These specializations emerge purely from training on text prediction. The model discovers that parallel, diverse viewpoints are useful β€” a form of ensemble learning within a single network.

06 β€” THE FULL PICTURE

Inside the Transformer Block

A Transformer is built from identical blocks stacked dozens or hundreds of times. Let's look inside one block, step by step, and see exactly what happens to the data at each stage. We'll follow three tokens β€” "cat" "sat" "down" β€” through every operation.

Step through each stage β€” watch the numbers transform in real time
β‘ Input
β†’
β‘‘Q, K, V
β†’
β‘’Scores
β†’
β‘£Softmax
β†’
β‘€Mix Values
β†’
β‘₯Add & Norm
β†’
⑦FFN
β†’
β‘§Output
β‘  Input: Token Vectors Arrive
Step 1 / 8
Current Stage
Input Vectors
Matrix Ops
0
Dimensions
3 Γ— 4

Why This Architecture Works β€” Three Key Ideas

πŸ”€
Residual Connections (Skip Connections)

Each sub-layer's output is added to its input: output = x + SubLayer(x). This means the network only needs to learn a correction, not rebuild the entire representation. Without this, deep networks (96+ layers) simply cannot train β€” gradients vanish to zero.

πŸ“
Layer Normalization

After each residual addition, the vector is normalized: subtract mean, divide by standard deviation, then scale and shift. This keeps activations stable across layers. Without it, values would drift exponentially through 120 layers.

🧠
The FFN as a Key-Value Memory

The feed-forward network expands each vector to 4Γ— width, applies ReLU, then compresses back. Recent research shows each FFN neuron activates for specific input patterns β€” effectively acting as a learned knowledge store. One neuron might encode "facts about France," another "Python syntax."

Stacking: From One Block to a Full Model

The block you just explored is repeated identically β€” one after another. Each layer reads from the residual stream and writes corrections back to it.

GPT-2
12 layers
GPT-3
96 layers
GPT-4
~120 layers
Claude
undisclosed

What does each layer learn? β€” Empirical research shows a rough pattern: early layers handle syntax and local patterns (word order, part of speech). Middle layers handle semantics (meaning, relationships, coreference). Late layers handle task-specific reasoning and output formatting. But this is a simplification β€” in reality, information is distributed across all layers.

"Each layer asks: given what I know so far, what single correction would help the most?"

07 β€” THE REVOLUTION

"Attention Is All You Need"
The Paper That Taught Machines to "Understand"

Before 2017 β€” The Age of Recurrence

Until 2017, machine translation and language models were dominated by RNNs (Recurrent Neural Networks) and their variant LSTMs. Their principle was intuitive β€” process words one at a time, in order, just like a human reading a sentence.

But there were fatal problems.

🐌
Slow training β€” Processing words sequentially meant 100 words required 100 serial steps. GPUs' parallel processing power was entirely wasted.
🧠
Fading memory β€” As sentences grew longer, information from early words faded by the time it reached later ones. LSTMs mitigated this, but long-range dependencies beyond a few hundred words remained extremely difficult.
πŸ“
Limited context β€” "The cat that the dog that the boy owned chased ran away" β€” for nested structures like this, RNNs struggled to connect "ran" back to "cat."

In 2014, Bahdanau added the attention mechanism as an auxiliary device to RNNs with great success. But nobody asked a more radical question β€”

"What if we throw away recurrence entirely?"

June 12, 2017 β€” A Paper Appears

Eight researchers from Google Brain and Google Research β€” Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, Illia Polosukhin β€” upload a paper to arXiv.

"Attention Is All You Need"
arXiv:1706.03762 Β· NeurIPS 2017
Citations: 130,000+ (one of the most cited AI papers in history)

The title alone was provocative. With every state-of-the-art model built on RNNs, claiming "attention is all you need" was close to heresy. But the results backed the boldness.

Three Revolutionary Shifts

⚑
Sequential β†’ Parallel

RNNs process words one at a time. Transformers process all words simultaneously. A 100-word sentence? RNNs need 100 steps; Transformers need 1. Training became orders of magnitude faster, which meant much larger models and much more data became feasible.

πŸ”­
Limited Memory β†’ Full Access

In an RNN, information from distant words dilutes through dozens of sequential steps. In self-attention, every word directly accesses every other word β€” regardless of distance, in a single operation. The information path between the first and last word shrinks from O(n) to O(1).

πŸ“
Complex Machinery β†’ Stunning Simplicity

A single LSTM cell has forget gates, input gates, output gates, cell states β€” intricately intertwined mechanisms. The Transformer's core is matrix multiplication, softmax, addition. That's it. What you saw from Β§00 to Β§06 is truly everything. This simplicity was, paradoxically, the key to scalability.

Listen to a rising tone as each model appears
From RNN to Transformer β€” a record of revolutionary leaps
Year
2014
Model
Seq2Seq + Attention
Parameters
~200M

The Scale of the Leap β€” Revolution in Numbers

What happened after the Transformer is hard to find precedent for in the history of science. A single architecture, combined with a single training objective (next-token prediction), showed that scaling up produced qualitatively new abilities.

ModelYearParametersTraining DataNewly Possible
Original Transformer201765MMillions of sentence pairsState-of-the-art translation
GPT-12018117MBooks, 5GBBasic text generation
BERT2018340MWiki+Books 16GBContextual word understanding
GPT-220191.5BWebText 40GBFluent paragraph generation (release withheld)
GPT-32020175B570GBIn-context learning, arithmetic, translation (zero-shot!)
PaLM2022540B780GBChain-of-thought, joke explanation
ChatGPT2022.11?+RLHFMass adoption β€” 1M users in 5 days
GPT-42023~1.8T (est.)~13T tokensTop 10% on bar exam, passes medical boards
Claude Opus 4.62026.2UndisclosedUndisclosed1M context, adaptive thinking, best-in-class coding (Claude Code)
Gemini 3 Pro2025.11~1.5T (MoE)UndisclosedNative multimodal, Sparse MoE, 2M context
GPT-5.42026.3UndisclosedUndisclosed1M context, native computer use, Thinking mode

65M β†’ 1.8T: a 27,000Γ— increase in parameters in 7 years

Anatomy of Three Giants β€” Gemini Β· ChatGPT Β· Claude

The three models leading the AI frontier in 2024 all grew from the same root, but evolved in different directions. Let's examine exactly what they share and where they diverge.

🀝 What They Share β€” The Same DNA

All three are built on the Transformer architecture born in "Attention Is All You Need" (2017). The core components β€” token embeddings, self-attention, feed-forward networks, residual connections, layer normalization β€” are exactly what you saw in Β§00–§06. The training objective is the same: next-token prediction. All use human feedback (RLHF or variants) for alignment.

πŸ”΅ Gemini
Google DeepMind
🟒 ChatGPT (GPT series)
OpenAI
🟠 Claude
Anthropic
Core Structure Transformer-based
Sparse MoE
~1.5T params, ~200B active per token Β· Deep Think mode
Decoder-only Transformer
Dense / MoE (unconfirmed)
GPT-5 series: Thinking mode (reasoning tokens) Β· Architecture undisclosed
Decoder-only Transformer
Dense
All parameters active for every token Β· Adaptive Thinking
Multimodal Natively multimodal
Text, image, audio, video unified training Β· image generation Β· robotics
Text + image + native computer use
GPT-5.4: image input, code execution, native UI control
Text-first + vision input
Image/PDF understanding, code execution, file creation Β· No image generation
Context Window Up to 2M tokens
Gemini 3 Pro (Nov 2025) Β· MoE + ultra-long context
1M tokens
GPT-5.4 (Mar 2026) Β· API
1M tokens
Opus 4.6 / Sonnet 4.6 (Feb 2026)
Alignment RLHF + safety filters
Based on Google's AI Principles
RLHF + reasoning oversight
Reward model + monitoring of Thinking mode reasoning chains
Constitutional AI (CAI)
AI evaluates AI β€” principle-based self-improvement + RLHF
Training Hardware Google TPU v5e/v6
Custom chips Β· own datacenters Β· Trillium
NVIDIA GPU + custom chips
Azure supercomputer (Microsoft partnership)
NVIDIA / custom GPU
AWS Bedrock Β· GCP Vertex AI
Key Strength Multimodal integration, ultra-long context, Google Search/services, TPU efficiency Ecosystem (Codex, plugins), computer use, first-mover advantage Coding (Claude Code), long-context precision, extended thinking, safety
πŸ”‘ The Key Structural Difference: Dense vs. Mixture of Experts

The most fundamental architectural difference is whether all parameters are always used, or only a subset is selectively activated.

DENSE (Claude confirmed / GPT presumed)

All parameters participate in every token's computation. Simple, but expensive β€” model size = compute cost. GPT-4 was reported to use MoE, but GPT-5 series architecture is undisclosed.

SPARSE MoE (Gemini)

The FFN layers split into multiple "experts," and a router assigns each token to only a few. Out of 1.5T total parameters, ~200B activate per token. Huge total capacity, small compute cost β€” capacity decoupled from cost.

But remember β€” self-attention, residual connections, layer normalization, next-token prediction β€”
the core mathematics is identical in all three.

"The point is not that the Transformer was a 'better model.'
The point is that the Transformer was a scalable model."

No matter how large you made an RNN, the sequential bottleneck capped training speed. Transformers got faster in proportion to GPUs added. That's the whole story.

What the Transformer enabled wasn't just better performance β€” it was the discovery of scaling laws: increase model size, data, and compute, and loss decreases along a predictable power law. No ceiling in sight.

2017. Eight researchers. A 31-page paper. That's where the AI revolution you're living through began. And everything in that paper β€” you just saw it all, from Β§00 through Β§06.

08 β€” THE ILLUSION OF UNDERSTANDING

The Illusion of Understanding

Here is the deepest surprise. The entire model β€” all the attention, all the embeddings, all the layers β€” is trained with one deceptively simple objective: given all previous tokens, predict the next one. Not "understand the text." Not "learn grammar." Just: what word comes next? And yet, from this statistical relay, something that looks exactly like understanding emerges.

A "ding" plays each time a token is sampled
The meaning of life is
Top candidates β€” probability distribution
Ο„ = 1.0
Temperature
1.0
Entropy
β€”
Tokens Generated
0
P(xi | x<i) = softmax( Wout Β· hi / Ο„ )

The statistical parrot β€” or something more? β€” Critics call LLMs "stochastic parrots": they merely predict probable next words without true understanding. And technically, this is correct β€” the loss function is just cross-entropy between predicted and actual next token. No grammar rules, no semantic annotations. Just "predict what comes next." But when this relay becomes sufficiently precise, over trillions of tokens, something uncanny happens: the machine begins to reason, to analogize, to explain. Is this understanding? Or the most sophisticated illusion of understanding ever created? That question remains open.

09 β€” THE MYSTERY

When Prediction Becomes "Understanding"

As models scale β€” more parameters, more data, more compute β€” something unsettling happens. Abilities appear that were completely absent in smaller models: arithmetic, translation, reasoning. Nobody programmed these. They emerged from next-token prediction alone. This is when the question "does a machine understand?" becomes genuinely hard to answer.

A rumble builds as the model grows β€” with chime accents at emergence thresholds
Emergent abilities vs. model scale
1M params
Parameters
1M
Abilities Unlocked
0 / 6
Loss
4.2

"Nobody told the model to learn arithmetic, or translation, or reasoning. These abilities emerged from a single objective: predict the next word."

This is perhaps the most profound fact about modern AI. A model trained only to predict text learns to do mathematics, write code, reason about physics, and translate between languages it was never explicitly taught. The mechanism by which this happens is not fully understood.

10 β€” THE CONNECTION

This Is All Linear Algebra

MathematicsTransformer
πŸ”’Matrix multiplicationEvery layer's core operation
πŸ“Inner product, cosine similarityAttention scores between tokens
πŸ“ŠSoftmax = normalized exponentialProbability from raw scores
πŸ”„Iterated function compositionStacking Transformer blocks
πŸ“‰Gradient descent on cross-entropyThe entire training algorithm
✨High-dimensional geometryEmergent representations

The Timeline

1943
McCulloch–Pitts
Artificial neuron
1986
Rumelhart et al.
Backpropagation
2013
Mikolov
Word2Vec
2017
Vaswani et al.
"Attention Is All
You Need"
2020–
GPT-3 β†’ GPT-4
Scaling & emergence

So β€” Does a Machine
Understand?

It computes dot products where we feel intuition.
It navigates vector spaces where we hold memories.
It predicts the next word where we grasp meaning.

Perhaps it does not understand.
Perhaps it has found something stranger β€”
a mathematical shadow of understanding
that works just as well.

Every simulation on this page is computed in real time β€” just linear algebra. The same linear algebra that powers every conversation you have with AI.

edu.kimsh.kr