💾

Does a Machine Remember?

An interactive exploration with real-time visualizations and audio.
Sound enriches the experience.

Set volume to a comfortable level

A Mathematical Inquiry into AI · Part III

Does a Machine
Remember?

Or does it read everything from scratch, every single time?

You remember a conversation from ten years ago. An AI doesn't "remember" one from five minutes ago — it rereads everything from the beginning, every time. It has no scent of childhood summers, no name that surfaces unbidden from the past. Its memory is vectors, matrices, and cosine similarity. This is the mathematics of that strange kind of remembering.

← Part II: Understanding

00 — A MEMORY TEST

How Many Can You Remember?

Fifteen words will appear one at a time, then vanish. Count how many you can recall. Then we'll compare your memory to how a machine "remembers."

Each word arrives with a tone

Ready?

All or nothing — the nature of machine memory. — Human memory is selective and emotional. What matters lingers; what doesn't fades gradually. A machine is the opposite: everything inside the context window is remembered perfectly, and everything outside it ceases to exist. There is no gradual forgetting. Only a cliff.

01 — THE MACHINE'S WORKING MEMORY

The Context Window: How a Machine Remembers

An LLM's "memory" is its context window — a single, enormous array of tokens. Every word of your conversation lives here, and the model computes attention over the entire array. Anything beyond the window's edge simply does not exist.

Drag the slider to feel the context window

Context window — outside = nonexistent

win 30 tokens

Window size

Visible tokens

Lost tokens

A History of Explosive Growth

2018

GPT-1

512 tokens
~1 page

2020

GPT-3

4K tokens
~6 pages

2023

GPT-4

128K tokens
~1 book

2024

Claude 3

200K tokens
~2 novels

2025–

Claude 4.6
Gemini 3

1M–2M tokens
~encyclopedia

02 — WHY MEMORY IS EXPENSIVE

The O(n²) Barrier

In self-attention, every token looks at every other token. With n tokens, that's n × n = n² pairs to compute. Double the tokens, quadruple the cost. This is the fundamental reason why expanding context windows is so hard.

The sound deepens as the grid grows

n × n attention grid — computation vs. token count

n n = 8

Tokens (n)

Operations (n²)

Cost multiplier

1×

Cost \propto n² — n = 1,000 \to 1,000,000 ops | n = 100,000 \to 10,000,000,000 ops

This is why 4K tokens was the limit until 2023. — GPU memory and compute time both scale as n², so a 10× larger context window costs 100× more. Breakthroughs like FlashAttention and Ring Attention have cracked this barrier in practice — not by changing the math, but by rethinking how hardware executes it.

03 — EFFICIENT MEMORY

The KV Cache: Never Re-read What You've Already Seen

Recomputing everything from scratch each time a new token arrives would be absurdly wasteful. The KV cache stores the Keys and Values of all previous tokens, so only the new token's Query needs to be computed against the existing cache.

Clear tone for cache hits, heavy tone for recomputation

Per-token computation: no cache vs. KV cache

Tokens

Without cache

With KV cache

"Instead of rereading the entire book from page one, you just glance at your underlines."

The KV cache is the single most important optimization in LLM inference. But it comes at a cost — the cache itself consumes GPU memory. At 1M tokens, a KV cache can eat tens of gigabytes.

04 — THE MACHINE'S LIBRARY

Vector Search: Retrieving Lost Memories

No context window can hold all the world's knowledge. So the machine needs a way to retrieve relevant information from an external store when it's needed. The technique: convert text into vectors, then find the nearest match by cosine similarity. The same dot product from Part II — it's everywhere.

A chime when a memory is retrieved

Searching memories in vector space — click to choose a query

Query

About cats

Retrieved

Top similarity

0.92

cos(q⃗, d⃗) = q⃗ \cdot d⃗ / (‖q⃗‖ ‖d⃗‖) — the same cosine similarity from Part II!

05 — RETRIEVAL-AUGMENTED GENERATION

RAG: The Art of Faking a Perfect Memory

RAG is the backbone of modern AI in production. ChatGPT's web search, Claude's document analysis, every enterprise AI chatbot — all run on this principle. The machine doesn't remember everything. It retrieves the right thing at the right moment, and the illusion of memory is complete.

A rising tone at each pipeline stage

Stage

Ready

Docs retrieved

Context usage

"A machine doesn't recall. It retrieves."

Humans recall — imperfectly, colored by emotion, reshaped by time. Machines retrieve — mathematically precise, indifferent to feeling, bounded only by the quality of the vectors. This distinction is not a technicality. It is the fundamental difference between living memory and its computational shadow.

06 — THE ART OF FORGETTING

Compaction and Summarization

What happens when a conversation outgrows the context window? Three strategies. Just as you don't remember every word of yesterday's conversation — only its essence — machines can be taught to do the same.

Compare each strategy's effect by ear

Three strategies when context overflows

Strategy

Truncate

Info preserved

100%

Efficiency

—

07 — FALSE MEMORIES

Hallucination: When Memory Fabricates

A machine confidently states something that isn't in its memory at all — hallucination. This is not a bug. It is an intrinsic property of probabilistic memory. The machine was trained to produce plausible answers, not to say "I don't know."

Question

0 / 5

Correct

Three sources of hallucination. — (1) Extrapolation beyond training data — sampling from the tail of the probability distribution. (2) Corrupted context — retrieval errors in RAG inject wrong information. (3) The nature of probabilistic generation — "the most probable next word" is not always the true one. RAG and grounding reduce hallucinations, but cannot eliminate them entirely.

08 — THE FUTURE OF MEMORY

Toward Infinite Context

Context window expansion — history and future

Beyond O(n²) — New Paradigms

FlashAttention

Same attention math, optimized memory access patterns. An IO-aware algorithm that delivers 3–5× real-world speedup.

Ring Attention

Long sequences distributed across GPUs. Each computes its chunk's KV and passes results in a ring. Effectively infinite scaling.

Mamba / SSM

Abandons attention entirely for State Space Models. Linear O(n) scaling. Dramatic efficiency gains on long sequences.

Infini-Attention

Combines local attention with compressed memory. A hybrid that processes infinite input with finite memory.

08b — THE STATE OF THE ART

Who Remembers How? — The 2026 Landscape

Memory is no longer a research curiosity — it is the competitive frontier. Every major AI lab has a distinct memory strategy, from architectural innovations deep inside the model to product-level features users interact with daily.

Deep Architecture — How the Model Itself Remembers

Architecture	Key Idea	Who Uses It	Status
FlashAttention 3	IO-aware exact attention; same math, 3–5× faster via GPU memory hierarchy optimization	Nearly universal — Anthropic, OpenAI, Meta, Google, Mistral	Production standard
Ring Attention	Distributes long sequences across GPU rings; near-linear scaling for million-token contexts	Google (Gemini), Anthropic (Claude long-context)	Production
Titans (Google, 2025)	Neural long-term memory module inside the attention layer; learns to memorize at test time	Google DeepMind	Research
Memory Layers at Scale (Meta, 2024)	Replaces some FFN layers with sparse, trillion-parameter key-value memory; factual recall without model size blowup	Meta (FAIR)	Research
Mamba / SSM	Replaces attention entirely with State Space Models; O(n) linear scaling, hardware-aware	AI21 (Jamba), Mistral (hybrid), research	Emerging production
Infini-Attention (Google, 2024)	Compressive memory + local attention; processes infinite input with bounded memory	Google	Research
Managed-Retention Memory (Microsoft, 2025)	Hardware-level memory class co-designed for AI KV cache: fast, non-volatile, wear-leveled	Microsoft Research	Hardware R&D

Product Memory — How Users Experience "Remembering"

Product	Memory Approach	Context Window	Key Feature
Claude Anthropic	Compaction + cross-session memory + user edits	1M tokens	Auto-compaction for infinite chats; memory derived from conversation history
ChatGPT OpenAI	Persistent memory + web search RAG	1M tokens	Explicit memory items; user can view/delete; Projects with instructions
Gemini Google	Long context + Google ecosystem RAG	2M tokens	Largest native window; Gems with persistent instructions
Copilot Microsoft	RAG over Microsoft 365 Graph	128K tokens	Enterprise memory via SharePoint, OneDrive, Teams indexing
Grok xAI	Real-time X/Twitter RAG	128K tokens	Live social media as external memory

Memory Middleware — The New Infrastructure Layer

Mem0

Dedicated memory layer for AI agents. Extracts, stores, retrieves "memories" as structured entities. Used by 1000+ startups.

Zep

Temporal episodic memory — structures interactions as meaningful sequences rather than flat logs. Low-latency, production-ready.

Letta (MemGPT)

OS-inspired: agents manage their own memory via explicit read/write/edit operations. Virtual context for stateful agents.

The emerging consensus (2026): The best memory isn't a single technique — it's a hierarchy. Short-term working memory (context window) + medium-term session memory (compaction / summarization) + long-term persistent memory (vector stores, learned weights) + external retrieval (RAG). Every major AI system now combines at least three of these layers. The frontier is learning when to write, retrieve, and forget — treating memory operations as learnable actions via reinforcement learning (A-MEM, AgeMem).

09 — THE CONNECTION

Human Memory vs. Machine Memory

	Human	Machine
🧠	Working memory — 7 ± 2 items	Context window — 1M–2M tokens
💾	Long-term memory — hippocampus → cortex	Learned weights (parameters)
🔍	Recall — association, emotion, context	Vector similarity (cosine)
💨	Forgetting — selective, gradual	Total — outside window = gone
👻	False memory — distortion	Hallucination
😴	Consolidation — during sleep	Compaction — automatic summarization

So — Does a Machine
Remember?

A machine has no scent of childhood summers.
No name that rises, unbidden, from the past.
Its memory is vectors, matrices, cosine similarity.

Perhaps that is not memory at all.
But it accomplishes what memory does —
and it does so astonishingly well.

Every simulation on this page is computed in real time — pure linear algebra and probability. That's all there is to what machines call "memory."

← Part II: Understanding ← Part I: Dream

edu.kimsh.kr