👁️

Does a Machine See?

An interactive exploration of images and filters in real time.

A Mathematical Inquiry into AI · Part VI

Does a Machine
See?

Or does it merely compute patterns in a grid of numbers?

You open your eyes and see the world — instantly, effortlessly, without conscious thought. To a machine, an image is a grid of numbers. Each pixel is a value from 0 to 255. Finding a "cat" in that array of numbers — that is the mathematics of Convolutional Neural Networks.

00 — WHAT A MACHINE SEES

An Image Is Numbers

When you see a photo, your brain interprets patterns of light. When a machine sees the same photo, all it sees is a matrix of numbers — each cell a brightness value from 0 (black) to 255 (white).

Hover over the image to see pixel values

What you see

What the machine sees

Pixel pos.

(—, —)

Brightness

—

Resolution

16 × 16

The meaning of "seeing" is entirely different. — Human vision grasps meaning intuitively. To a machine, everything is a 2D array of numbers.

01 — THE CORE OPERATION

Convolution: How a Machine's Eye Moves

A small filter (kernel) slides over the image. At each position, it computes the sum of element-wise products between the filter and the image patch.

The tone changes with activation intensity as the kernel moves

3×3 kernel sliding — input → kernel → output

Kernel

Horiz. edge

Position

(0,0)

Output

—

(f * g)[i,j] = ΣΣ f[m,n] \cdot g[i-m, j-n] — The dot product extended to 2D!

Convolution is the 2D extension of the dot product. — Just as the dot product in Part II measured similarity between two vectors, convolution measures similarity between an image patch and a filter template.

02 — THE MACHINE'S VISUAL CORTEX

Feature Maps: Deeper and Deeper

First layers find edges, next layers textures, deeper layers object parts, final layers whole concepts. This hierarchy is strikingly similar to the human visual cortex (V1→V2→V4→IT).

CNN visual hierarchy — edges → textures → parts → objects

Current layer

—

Detects

—

Filters

—

03 — THE ART OF COMPRESSION

Pooling: Keeping Only What Matters

Take only the maximum from each 2×2 region, halving the size. This is the secret to recognizing a cat wherever it is — translation invariance.

Max Pooling: 2×2 → maximum

Input

8×8

Output

8×8

Compression

1×

04 — DRAW IT YOURSELF

What Does a CNN See in Your Drawing?

Draw freely below. On the right, a CNN edge detection filter runs in real time.

Draw here

Edges the CNN sees

05 — THE REVOLUTION

The ImageNet Moment

In 2012, AlexNet won ImageNet by a crushing margin. Then VGGNet, GoogLeNet, and ResNet followed — eventually surpassing even human error rates (5%).

CNN → Vision Transformer

LeNet-5 → AlexNet → ResNet

1998–2015. The convolution era. Deeper, wider. Residual connections broke through 152 layers.

ViT (2020) → Present

Attention only, no convolutions. The Transformer from Part II applied to vision. Challenging CNN's throne.

06 — THE CONNECTION

Human Vision vs. Machine Vision

	Human	CNN
👁️	Sensor — 130 million photoreceptors	Sensor — Pixels (RGB values)
🧠	Processing — V1→V2→V4→IT	Processing — Conv→Pool→Conv→...
🔍	Attention — Fovea + peripheral	Attention — Kernel sliding
💫	Illusions — Optical illusions, pareidolia	Illusions — Adversarial patches, DeepDream
👶	Learning — Generalizes from few exposures	Learning — Needs millions of images

So — Does a Machine
See?

A machine's eye holds no sense of beauty.
No awe at a sunset, no warmth in recognizing a loved one's face.
Only grids of numbers, sliding kernels, and stacked feature maps.

Perhaps that is not seeing.
But machines find tumors, classify stars, and recognize your face —
sometimes more accurately than we can.

← Part V: Speak

edu.kimsh.kr