πŸ’Ύ

Does a Machine Remember?

An interactive exploration with real-time visualizations and audio.
Sound enriches the experience.

Set volume to a comfortable level

A Mathematical Inquiry into AI Β· Part III

Does a Machine
Remember?

Or does it read everything from scratch, every single time?

You remember a conversation from ten years ago. An AI doesn't "remember" one from five minutes ago β€” it rereads everything from the beginning, every time. It has no scent of childhood summers, no name that surfaces unbidden from the past. Its memory is vectors, matrices, and cosine similarity. This is the mathematics of that strange kind of remembering.

← Part II: Understanding

00 β€” A MEMORY TEST

How Many Can You Remember?

Fifteen words will appear one at a time, then vanish. Count how many you can recall. Then we'll compare your memory to how a machine "remembers."

Each word arrives with a tone
Ready?

All or nothing β€” the nature of machine memory. β€” Human memory is selective and emotional. What matters lingers; what doesn't fades gradually. A machine is the opposite: everything inside the context window is remembered perfectly, and everything outside it ceases to exist. There is no gradual forgetting. Only a cliff.

01 β€” THE MACHINE'S WORKING MEMORY

The Context Window: How a Machine Remembers

An LLM's "memory" is its context window β€” a single, enormous array of tokens. Every word of your conversation lives here, and the model computes attention over the entire array. Anything beyond the window's edge simply does not exist.

Drag the slider to feel the context window
Context window β€” outside = nonexistent
30 tokens
Window size
30
Visible tokens
30
Lost tokens
0

A History of Explosive Growth

2018
GPT-1
512 tokens
~1 page
2020
GPT-3
4K tokens
~6 pages
2023
GPT-4
128K tokens
~1 book
2024
Claude 3
200K tokens
~2 novels
2025–
Claude 4.6
Gemini 3
1M–2M tokens
~encyclopedia

02 β€” WHY MEMORY IS EXPENSIVE

The O(nΒ²) Barrier

In self-attention, every token looks at every other token. With n tokens, that's n Γ— n = nΒ² pairs to compute. Double the tokens, quadruple the cost. This is the fundamental reason why expanding context windows is so hard.

The sound deepens as the grid grows
n Γ— n attention grid β€” computation vs. token count
n = 8
Tokens (n)
8
Operations (nΒ²)
64
Cost multiplier
1Γ—
Cost ∝ nΒ² β€” n = 1,000 β†’ 1,000,000 ops  |  n = 100,000 β†’ 10,000,000,000 ops

This is why 4K tokens was the limit until 2023. β€” GPU memory and compute time both scale as nΒ², so a 10Γ— larger context window costs 100Γ— more. Breakthroughs like FlashAttention and Ring Attention have cracked this barrier in practice β€” not by changing the math, but by rethinking how hardware executes it.

03 β€” EFFICIENT MEMORY

The KV Cache: Never Re-read What You've Already Seen

Recomputing everything from scratch each time a new token arrives would be absurdly wasteful. The KV cache stores the Keys and Values of all previous tokens, so only the new token's Query needs to be computed against the existing cache.

Clear tone for cache hits, heavy tone for recomputation
Per-token computation: no cache vs. KV cache
Tokens
0
Without cache
0
With KV cache
0

"Instead of rereading the entire book from page one, you just glance at your underlines."

The KV cache is the single most important optimization in LLM inference. But it comes at a cost β€” the cache itself consumes GPU memory. At 1M tokens, a KV cache can eat tens of gigabytes.

04 β€” THE MACHINE'S LIBRARY

Vector Search: Retrieving Lost Memories

No context window can hold all the world's knowledge. So the machine needs a way to retrieve relevant information from an external store when it's needed. The technique: convert text into vectors, then find the nearest match by cosine similarity. The same dot product from Part II β€” it's everywhere.

A chime when a memory is retrieved
Searching memories in vector space β€” click to choose a query
Query
About cats
Retrieved
3개
Top similarity
0.92
cos(q⃗, d⃗) = q⃗ · d⃗ / (‖q⃗‖ ‖d⃗‖) — the same cosine similarity from Part II!

05 β€” RETRIEVAL-AUGMENTED GENERATION

RAG: The Art of Faking a Perfect Memory

RAG is the backbone of modern AI in production. ChatGPT's web search, Claude's document analysis, every enterprise AI chatbot β€” all run on this principle. The machine doesn't remember everything. It retrieves the right thing at the right moment, and the illusion of memory is complete.

A rising tone at each pipeline stage
Stage
Ready
Docs retrieved
0
Context usage
0%

"A machine doesn't recall. It retrieves."

Humans recall β€” imperfectly, colored by emotion, reshaped by time. Machines retrieve β€” mathematically precise, indifferent to feeling, bounded only by the quality of the vectors. This distinction is not a technicality. It is the fundamental difference between living memory and its computational shadow.

06 β€” THE ART OF FORGETTING

Compaction and Summarization

What happens when a conversation outgrows the context window? Three strategies. Just as you don't remember every word of yesterday's conversation β€” only its essence β€” machines can be taught to do the same.

Compare each strategy's effect by ear
Three strategies when context overflows
Strategy
Truncate
Info preserved
100%
Efficiency
β€”

07 β€” FALSE MEMORIES

Hallucination: When Memory Fabricates

A machine confidently states something that isn't in its memory at all β€” hallucination. This is not a bug. It is an intrinsic property of probabilistic memory. The machine was trained to produce plausible answers, not to say "I don't know."

Question
0 / 5
Correct
0

Three sources of hallucination. β€” (1) Extrapolation beyond training data β€” sampling from the tail of the probability distribution. (2) Corrupted context β€” retrieval errors in RAG inject wrong information. (3) The nature of probabilistic generation β€” "the most probable next word" is not always the true one. RAG and grounding reduce hallucinations, but cannot eliminate them entirely.

08 β€” THE FUTURE OF MEMORY

Toward Infinite Context

Context window expansion β€” history and future

Beyond O(nΒ²) β€” New Paradigms

FlashAttention

Same attention math, optimized memory access patterns. An IO-aware algorithm that delivers 3–5Γ— real-world speedup.

Ring Attention

Long sequences distributed across GPUs. Each computes its chunk's KV and passes results in a ring. Effectively infinite scaling.

Mamba / SSM

Abandons attention entirely for State Space Models. Linear O(n) scaling. Dramatic efficiency gains on long sequences.

Infini-Attention

Combines local attention with compressed memory. A hybrid that processes infinite input with finite memory.

08b β€” THE STATE OF THE ART

Who Remembers How? β€” The 2026 Landscape

Memory is no longer a research curiosity β€” it is the competitive frontier. Every major AI lab has a distinct memory strategy, from architectural innovations deep inside the model to product-level features users interact with daily.

Deep Architecture β€” How the Model Itself Remembers

ArchitectureKey IdeaWho Uses ItStatus
FlashAttention 3IO-aware exact attention; same math, 3–5Γ— faster via GPU memory hierarchy optimizationNearly universal β€” Anthropic, OpenAI, Meta, Google, MistralProduction standard
Ring AttentionDistributes long sequences across GPU rings; near-linear scaling for million-token contextsGoogle (Gemini), Anthropic (Claude long-context)Production
Titans (Google, 2025)Neural long-term memory module inside the attention layer; learns to memorize at test timeGoogle DeepMindResearch
Memory Layers at Scale (Meta, 2024)Replaces some FFN layers with sparse, trillion-parameter key-value memory; factual recall without model size blowupMeta (FAIR)Research
Mamba / SSMReplaces attention entirely with State Space Models; O(n) linear scaling, hardware-awareAI21 (Jamba), Mistral (hybrid), researchEmerging production
Infini-Attention (Google, 2024)Compressive memory + local attention; processes infinite input with bounded memoryGoogleResearch
Managed-Retention Memory (Microsoft, 2025)Hardware-level memory class co-designed for AI KV cache: fast, non-volatile, wear-leveledMicrosoft ResearchHardware R&D

Product Memory β€” How Users Experience "Remembering"

ProductMemory ApproachContext WindowKey Feature
Claude AnthropicCompaction + cross-session memory + user edits1M tokensAuto-compaction for infinite chats; memory derived from conversation history
ChatGPT OpenAIPersistent memory + web search RAG1M tokensExplicit memory items; user can view/delete; Projects with instructions
Gemini GoogleLong context + Google ecosystem RAG2M tokensLargest native window; Gems with persistent instructions
Copilot MicrosoftRAG over Microsoft 365 Graph128K tokensEnterprise memory via SharePoint, OneDrive, Teams indexing
Grok xAIReal-time X/Twitter RAG128K tokensLive social media as external memory

Memory Middleware β€” The New Infrastructure Layer

Mem0

Dedicated memory layer for AI agents. Extracts, stores, retrieves "memories" as structured entities. Used by 1000+ startups.

Zep

Temporal episodic memory β€” structures interactions as meaningful sequences rather than flat logs. Low-latency, production-ready.

Letta (MemGPT)

OS-inspired: agents manage their own memory via explicit read/write/edit operations. Virtual context for stateful agents.

The emerging consensus (2026): The best memory isn't a single technique β€” it's a hierarchy. Short-term working memory (context window) + medium-term session memory (compaction / summarization) + long-term persistent memory (vector stores, learned weights) + external retrieval (RAG). Every major AI system now combines at least three of these layers. The frontier is learning when to write, retrieve, and forget β€” treating memory operations as learnable actions via reinforcement learning (A-MEM, AgeMem).

09 β€” THE CONNECTION

Human Memory vs. Machine Memory

HumanMachine
🧠Working memory β€” 7 Β± 2 itemsContext window β€” 1M–2M tokens
πŸ’ΎLong-term memory β€” hippocampus β†’ cortexLearned weights (parameters)
πŸ”Recall β€” association, emotion, contextVector similarity (cosine)
πŸ’¨Forgetting β€” selective, gradualTotal β€” outside window = gone
πŸ‘»False memory β€” distortionHallucination
😴Consolidation β€” during sleepCompaction β€” automatic summarization

So β€” Does a Machine
Remember?

A machine has no scent of childhood summers.
No name that rises, unbidden, from the past.
Its memory is vectors, matrices, cosine similarity.

Perhaps that is not memory at all.
But it accomplishes what memory does β€”
and it does so astonishingly well.

Every simulation on this page is computed in real time β€” pure linear algebra and probability. That's all there is to what machines call "memory."

Does a Machine Remember? β€” From Context Windows to Vector Search

Every visualization is computed in real time β€” pure mathematics.

Sang-hyun Kim
Korea Institute for Advanced Study
kimsh.kr
edu.kimsh.kr