๐ŸŽต

Does a Machine Speak?

This page is an interactive exploration where sound is essential.
Please turn your volume on!

Set volume to a comfortable level

A Mathematical Inquiry into AI ยท Part V

Does a Machine
Speak?

Or does it merely compute the patterns of vibration and fake a voice?

Your voice is a miracle of vibrating vocal cords, a dancing tongue, and the breath of your lungs. A machine's "voice" is a sum of sine waves, a spectrogram pattern, and a neural network's prediction. Same sound, entirely different origins. Let's listen to the mathematics.

00 โ€” WHAT IS SOUND

Vibrations in Air

Sound is pressure changes in air โ€” fast vibrations make high pitch, slow vibrations make low pitch. All sound is ultimately a wave. Adjust the sliders below and hear it for yourself.

Move the sliders to hear sound in real time
Sine wave โ€” the purest possible sound
440 Hz
50%
Frequency
440 Hz
Note
A4
Wavelength
0.78 m
f(t) = A ยท sin(2ฯ€ ยท freq ยท t) โ€” This single equation is a pure tone

01 โ€” THE ANATOMY OF SOUND

The Fourier Transform: Decomposing Sound

In 1822, Joseph Fourier proved a revolutionary fact: any complex waveform can be decomposed into a sum of simple sine waves. A piano's middle C is not one frequency โ€” it's the sum of a fundamental frequency and dozens of harmonics.

Add or remove harmonics and hear the timbre change
Harmonic synthesis โ€” adding sine waves to build complex sound
Fundamental
220 Hz (A3)
Active harmonics
1
Timbre
pure
f(t) = ฮฃ aโ‚™ ยท sin(2ฯ€ ยท n ยท fโ‚€ ยท t) โ€” Every sound = a sum of sine waves

Why the Fourier Transform matters for AI. โ€” Speech recognition, music generation, TTS โ€” the first step of every audio AI is decomposing sound into frequency components. Transforming a complex time-domain waveform into a clean frequency-domain spectrum. This is how a machine "sees" sound.

02 โ€” THE FINGERPRINT OF SOUND

The Spectrogram: Seeing Sound

A spectrogram is a 3D map: time (horizontal) ร— frequency (vertical) ร— intensity (brightness). Even the same syllable looks different for every speaker and every emotion. AI reads these "photographs of sound."

Press piano keys to draw a real-time spectrogram
Real-time spectrogram โ€” play the keys
Playing
โ€”
Frequency
โ€”

"Humans hear sound. Machines see it."

The mel spectrogram scales the frequency axis nonlinearly to match human hearing: wide bins for low frequencies, narrow for high. Exactly how our ears perceive pitch. This is the standard input format for speech AI.

03 โ€” VOICE AS NUMBERS

From Analog to Digital

To handle continuous air vibrations, a computer must sample them โ€” measuring amplitude tens of thousands of times per second and converting it to an array of numbers. CD quality is 44,100 Hz. Phone calls: 8,000 Hz.

Lower the sampling rate to hear quality degrade
Sampling: continuous wave โ†’ discrete points
60 samples
Samples
60
Info lost
4%
Nyquist theorem: f_s โ‰ฅ 2ยทf_max โ€” sample at 2ร— the highest frequency to reconstruct the original

04 โ€” HOW A MACHINE SPEAKS

TTS: From Text to Voice

Modern TTS works in three stages: text analysis โ†’ spectrogram generation โ†’ waveform reconstruction. Each stage uses a neural network.

Stage
Ready
Model
โ€”

The Evolution of TTS

Gen 1: Concatenation

Pre-recorded phoneme fragments stitched together. Robotic voice. Used in GPS navigation.

Gen 2: WaveNet (2016)

Google DeepMind. Predicts audio samples one at a time. First natural-sounding AI voice. But extremely slow.

Gen 3: Transformer TTS

VALL-E, Bark, ElevenLabs. Clone anyone with 3 seconds of audio. Real-time synthesis. Nearly indistinguishable.

05 โ€” VOICE CLONING

Voice Cloning: Three Seconds Is Enough

Modern voice cloning captures a speaker's characteristics from just 3 seconds of audio โ€” timbre, pace, intonation, even emotion. Technically, the speaker's voice is compressed into a speaker embedding vector and injected as a condition into the TTS model.

Each "speaker" produces a different waveform
Speaker embeddings: same text, different voices
Fundamental
110 Hz
Timbre
deep
Embedding dim
256-d

06 โ€” MACHINE COMPOSITION

Generating Music

Music AI โ€” Suno, Udio, MusicGen โ€” creates music from text prompts. Internally, it converts audio to tokens (via codecs) and predicts the next token with a Transformer. Exactly the same principle from Part II, applied to sound.

Press a genre button to hear algorithmic composition
Algorithmic composition โ€” probabilistic melody

The key insight of music AI. โ€” Next-token prediction for text (Part II) and next-token prediction for music are mathematically identical. The only difference is what the tokens represent โ€” text AI predicts words; music AI predicts audio codec tokens. Same Transformer, different language.

07 โ€” THE CONNECTION

Human Voice vs. Machine Voice

HumanMachine
๐ŸซSource โ€” Vocal cord vibration + lung airSource โ€” Sine wave synthesis / neural vocoder
๐Ÿ‘…Articulation โ€” Tongue, lips, oral cavityArticulation โ€” Filter coefficients / mel spectrogram
๐Ÿง Language โ€” Broca's area, Wernicke's areaLanguage โ€” Transformer encoder
๐ŸŽญEmotion โ€” Autonomic nervous system, hormonesEmotion โ€” Prosody embedding vector
๐Ÿ‘ถLearning โ€” Years of imitation and feedbackLearning โ€” Tens of thousands of hours of speech data
๐ŸŽตSinging โ€” Emotion + musical trainingSinging โ€” Audio codec token prediction

So โ€” Does a Machine
Speak?

A machine's voice has no breath.
No tremor, no crack, no laughter woven into words.
Only sums of sine waves, spectrogram patterns,
and astonishingly precise probabilistic predictions.

Perhaps that is not speaking.
But your ears can no longer
tell the difference.

โ† Part IV: Imagine
edu.kimsh.kr