🎵

Does a Machine Speak?

This page is an interactive exploration where sound is essential.
Please turn your volume on!

Set volume to a comfortable level

A Mathematical Inquiry into AI · Part V

Does a Machine
Speak?

Or does it merely compute the patterns of vibration and fake a voice?

Your voice is a miracle of vibrating vocal cords, a dancing tongue, and the breath of your lungs. A machine's "voice" is a sum of sine waves, a spectrogram pattern, and a neural network's prediction. Same sound, entirely different origins. Let's listen to the mathematics.

00 — WHAT IS SOUND

Vibrations in Air

Sound is pressure changes in air — fast vibrations make high pitch, slow vibrations make low pitch. All sound is ultimately a wave. Adjust the sliders below and hear it for yourself.

Move the sliders to hear sound in real time

Sine wave — the purest possible sound

Freq440 Hz

Amp50%

Wave

Frequency

440 Hz

Note

Wavelength

0.78 m

f(t) = A \cdot sin(2π \cdot freq \cdot t) — This single equation is a pure tone

01 — THE ANATOMY OF SOUND

The Fourier Transform: Decomposing Sound

In 1822, Joseph Fourier proved a revolutionary fact: any complex waveform can be decomposed into a sum of simple sine waves. A piano's middle C is not one frequency — it's the sum of a fundamental frequency and dozens of harmonics.

Add or remove harmonics and hear the timbre change

Harmonic synthesis — adding sine waves to build complex sound

Fundamental

220 Hz (A3)

Active harmonics

Timbre

pure

f(t) = Σ aₙ \cdot sin(2π \cdot n \cdot f₀ \cdot t) — Every sound = a sum of sine waves

Why the Fourier Transform matters for AI. — Speech recognition, music generation, TTS — the first step of every audio AI is decomposing sound into frequency components. Transforming a complex time-domain waveform into a clean frequency-domain spectrum. This is how a machine "sees" sound.

02 — THE FINGERPRINT OF SOUND

The Spectrogram: Seeing Sound

A spectrogram is a 3D map: time (horizontal) × frequency (vertical) × intensity (brightness). Even the same syllable looks different for every speaker and every emotion. AI reads these "photographs of sound."

Press piano keys to draw a real-time spectrogram

Real-time spectrogram — play the keys

Playing

—

Frequency

—

"Humans hear sound. Machines see it."

The mel spectrogram scales the frequency axis nonlinearly to match human hearing: wide bins for low frequencies, narrow for high. Exactly how our ears perceive pitch. This is the standard input format for speech AI.

03 — VOICE AS NUMBERS

From Analog to Digital

To handle continuous air vibrations, a computer must sample them — measuring amplitude tens of thousands of times per second and converting it to an array of numbers. CD quality is 44,100 Hz. Phone calls: 8,000 Hz.

Lower the sampling rate to hear quality degrade

Sampling: continuous wave → discrete points

SR60 samples

Samples

Info lost

Nyquist theorem: f_s \geq 2\cdotf_max — sample at 2\times the highest frequency to reconstruct the original

04 — HOW A MACHINE SPEAKS

TTS: From Text to Voice

Modern TTS works in three stages: text analysis → spectrogram generation → waveform reconstruction. Each stage uses a neural network.

Stage

Ready

Model

—

The Evolution of TTS

Gen 1: Concatenation

Pre-recorded phoneme fragments stitched together. Robotic voice. Used in GPS navigation.

Gen 2: WaveNet (2016)

Google DeepMind. Predicts audio samples one at a time. First natural-sounding AI voice. But extremely slow.

Gen 3: Transformer TTS

VALL-E, Bark, ElevenLabs. Clone anyone with 3 seconds of audio. Real-time synthesis. Nearly indistinguishable.

05 — VOICE CLONING

Voice Cloning: Three Seconds Is Enough

Modern voice cloning captures a speaker's characteristics from just 3 seconds of audio — timbre, pace, intonation, even emotion. Technically, the speaker's voice is compressed into a speaker embedding vector and injected as a condition into the TTS model.

Each "speaker" produces a different waveform

Speaker embeddings: same text, different voices

Fundamental

110 Hz

Timbre

deep

Embedding dim

256-d

06 — MACHINE COMPOSITION

Generating Music

Music AI — Suno, Udio, MusicGen — creates music from text prompts. Internally, it converts audio to tokens (via codecs) and predicts the next token with a Transformer. Exactly the same principle from Part II, applied to sound.

Press a genre button to hear algorithmic composition

Algorithmic composition — probabilistic melody

The key insight of music AI. — Next-token prediction for text (Part II) and next-token prediction for music are mathematically identical. The only difference is what the tokens represent — text AI predicts words; music AI predicts audio codec tokens. Same Transformer, different language.

07 — THE CONNECTION

Human Voice vs. Machine Voice

	Human	Machine
🫁	Source — Vocal cord vibration + lung air	Source — Sine wave synthesis / neural vocoder
👅	Articulation — Tongue, lips, oral cavity	Articulation — Filter coefficients / mel spectrogram
🧠	Language — Broca's area, Wernicke's area	Language — Transformer encoder
🎭	Emotion — Autonomic nervous system, hormones	Emotion — Prosody embedding vector
👶	Learning — Years of imitation and feedback	Learning — Tens of thousands of hours of speech data
🎵	Singing — Emotion + musical training	Singing — Audio codec token prediction

So — Does a Machine
Speak?

A machine's voice has no breath.
No tremor, no crack, no laughter woven into words.
Only sums of sine waves, spectrogram patterns,
and astonishingly precise probabilistic predictions.

Perhaps that is not speaking.
But your ears can no longer
tell the difference.

← Part IV: Imagine

edu.kimsh.kr