👁️

Does a Machine See?

An interactive exploration of images and filters in real time.

A Mathematical Inquiry into AI · Part VI

Does a Machine
See?

Or does it merely compute patterns in a grid of numbers?

You open your eyes and see the world — instantly, effortlessly, without conscious thought. To a machine, an image is a grid of numbers. Each pixel is a value from 0 to 255. Finding a "cat" in that array of numbers — that is the mathematics of Convolutional Neural Networks.

00 — WHAT A MACHINE SEES

An Image Is Numbers

When you see a photo, your brain interprets patterns of light. When a machine sees the same photo, all it sees is a matrix of numbers — each cell a brightness value from 0 (black) to 255 (white).

Hover over the image to see pixel values
What you see
What the machine sees
Pixel pos.
(—, —)
Brightness
Resolution
16 × 16

The meaning of "seeing" is entirely different. — Human vision grasps meaning intuitively. To a machine, everything is a 2D array of numbers.

01 — THE CORE OPERATION

Convolution: How a Machine's Eye Moves

A small filter (kernel) slides over the image. At each position, it computes the sum of element-wise products between the filter and the image patch.

The tone changes with activation intensity as the kernel moves
3×3 kernel sliding — input → kernel → output
Kernel
Horiz. edge
Position
(0,0)
Output
(f * g)[i,j] = ΣΣ f[m,n] · g[i-m, j-n] — The dot product extended to 2D!

Convolution is the 2D extension of the dot product. — Just as the dot product in Part II measured similarity between two vectors, convolution measures similarity between an image patch and a filter template.

02 — THE MACHINE'S VISUAL CORTEX

Feature Maps: Deeper and Deeper

First layers find edges, next layers textures, deeper layers object parts, final layers whole concepts. This hierarchy is strikingly similar to the human visual cortex (V1→V2→V4→IT).

CNN visual hierarchy — edges → textures → parts → objects
Current layer
Detects
Filters

03 — THE ART OF COMPRESSION

Pooling: Keeping Only What Matters

Take only the maximum from each 2×2 region, halving the size. This is the secret to recognizing a cat wherever it is — translation invariance.

Max Pooling: 2×2 → maximum
Input
8×8
Output
8×8
Compression

04 — DRAW IT YOURSELF

What Does a CNN See in Your Drawing?

Draw freely below. On the right, a CNN edge detection filter runs in real time.

Draw here
Edges the CNN sees

05 — THE REVOLUTION

The ImageNet Moment

In 2012, AlexNet won ImageNet by a crushing margin. Then VGGNet, GoogLeNet, and ResNet followed — eventually surpassing even human error rates (5%).

CNN → Vision Transformer

LeNet-5 → AlexNet → ResNet

1998–2015. The convolution era. Deeper, wider. Residual connections broke through 152 layers.

ViT (2020) → Present

Attention only, no convolutions. The Transformer from Part II applied to vision. Challenging CNN's throne.

06 — THE CONNECTION

Human Vision vs. Machine Vision

HumanCNN
👁️Sensor — 130 million photoreceptorsSensor — Pixels (RGB values)
🧠Processing — V1→V2→V4→ITProcessing — Conv→Pool→Conv→...
🔍Attention — Fovea + peripheralAttention — Kernel sliding
💫Illusions — Optical illusions, pareidoliaIllusions — Adversarial patches, DeepDream
👶Learning — Generalizes from few exposuresLearning — Needs millions of images

So — Does a Machine
See?

A machine's eye holds no sense of beauty.
No awe at a sunset, no warmth in recognizing a loved one's face.
Only grids of numbers, sliding kernels, and stacked feature maps.

Perhaps that is not seeing.
But machines find tumors, classify stars, and recognize your face —
sometimes more accurately than we can.

← Part V: Speak
edu.kimsh.kr