πŸ‘οΈ

Does a Machine See?

An interactive exploration of images and filters in real time.

A Mathematical Inquiry into AI Β· Part VI

Does a Machine
See?

Or does it merely compute patterns in a grid of numbers?

You open your eyes and see the world β€” instantly, effortlessly, without conscious thought. To a machine, an image is a grid of numbers. Each pixel is a value from 0 to 255. Finding a "cat" in that array of numbers β€” that is the mathematics of Convolutional Neural Networks.

00 β€” WHAT A MACHINE SEES

An Image Is Numbers

When you see a photo, your brain interprets patterns of light. When a machine sees the same photo, all it sees is a matrix of numbers β€” each cell a brightness value from 0 (black) to 255 (white).

Hover over the image to see pixel values
What you see
What the machine sees
Pixel pos.
(β€”, β€”)
Brightness
β€”
Resolution
16 Γ— 16

The meaning of "seeing" is entirely different. β€” Human vision grasps meaning intuitively. To a machine, everything is a 2D array of numbers.

01 β€” THE CORE OPERATION

Convolution: How a Machine's Eye Moves

A small filter (kernel) slides over the image. At each position, it computes the sum of element-wise products between the filter and the image patch.

The tone changes with activation intensity as the kernel moves
3Γ—3 kernel sliding β€” input β†’ kernel β†’ output
Kernel
Horiz. edge
Position
(0,0)
Output
β€”
(f * g)[i,j] = ΣΣ f[m,n] Β· g[i-m, j-n] β€” The dot product extended to 2D!

Convolution is the 2D extension of the dot product. β€” Just as the dot product in Part II measured similarity between two vectors, convolution measures similarity between an image patch and a filter template.

02 β€” THE MACHINE'S VISUAL CORTEX

Feature Maps: Deeper and Deeper

First layers find edges, next layers textures, deeper layers object parts, final layers whole concepts. This hierarchy is strikingly similar to the human visual cortex (V1β†’V2β†’V4β†’IT).

CNN visual hierarchy β€” edges β†’ textures β†’ parts β†’ objects
Current layer
β€”
Detects
β€”
Filters
β€”

03 β€” THE ART OF COMPRESSION

Pooling: Keeping Only What Matters

Take only the maximum from each 2Γ—2 region, halving the size. This is the secret to recognizing a cat wherever it is β€” translation invariance.

Max Pooling: 2Γ—2 β†’ maximum
Input
8Γ—8
Output
8Γ—8
Compression
1Γ—

04 β€” DRAW IT YOURSELF

What Does a CNN See in Your Drawing?

Draw freely below. On the right, a CNN edge detection filter runs in real time.

Draw here
Edges the CNN sees

05 β€” THE REVOLUTION

The ImageNet Moment

In 2012, AlexNet won ImageNet by a crushing margin. Then VGGNet, GoogLeNet, and ResNet followed β€” eventually surpassing even human error rates (5%).

CNN β†’ Vision Transformer

LeNet-5 β†’ AlexNet β†’ ResNet

1998–2015. The convolution era. Deeper, wider. Residual connections broke through 152 layers.

ViT (2020) β†’ Present

Attention only, no convolutions. The Transformer from Part II applied to vision. Challenging CNN's throne.

06 β€” THE CONNECTION

Human Vision vs. Machine Vision

HumanCNN
πŸ‘οΈSensor β€” 130 million photoreceptorsSensor β€” Pixels (RGB values)
🧠Processing β€” V1β†’V2β†’V4β†’ITProcessing β€” Convβ†’Poolβ†’Convβ†’...
πŸ”Attention β€” Fovea + peripheralAttention β€” Kernel sliding
πŸ’«Illusions β€” Optical illusions, pareidoliaIllusions β€” Adversarial patches, DeepDream
πŸ‘ΆLearning β€” Generalizes from few exposuresLearning β€” Needs millions of images

So β€” Does a Machine
See?

A machine's eye holds no sense of beauty.
No awe at a sunset, no warmth in recognizing a loved one's face.
Only grids of numbers, sliding kernels, and stacked feature maps.

Perhaps that is not seeing.
But machines find tumors, classify stars, and recognize your face β€”
sometimes more accurately than we can.

← Part V: Speak
edu.kimsh.kr