Neural Network Architecture Evolution

Stage 1: The Single Perceptron

The simplest neural network: a single computational unit that computes a weighted sum of inputs, adds a bias, and applies a step activation function.

y = θ(w₀x₀ + w₁x₁ + b)

where θ(z) = 1 if z ≥ 0, else 0 (Heaviside step function)

Perceptron Diagram

Decision Boundary

w₀ 1.0

w₁ 1.0

b 0.0

Stage 2: Logic Gates with Perceptrons

A single perceptron can implement linearly separable Boolean functions.

Perceptron Weights

Decision Boundary

Stage 3: The XOR Problem

Minsky & Papert (1969) proved that XOR cannot be solved by a single perceptron. This limitation nearly killed the field of neural networks for over a decade.

XOR Truth Table

x₀	x₁	XOR
0	0	0
0	1	1
1	0	1
1	1	0

XOR is NOT linearly separable.
No single hyperplane (line in 2D) can separate the class-0 and class-1 points. Any line will misclassify at least one point.

Impossible Separation

The dashed lines show attempted decision boundaries — every attempt fails to correctly separate all four points.

Stage 4: Multi-Layer Perceptron Solves XOR

By adding a hidden layer with 2 neurons, we can solve XOR. The hidden layer transforms the input space so that the classes become linearly separable.

MLP Architecture (2→2→1)

Step-Through Computation

Click Next to walk through the forward pass for each input pair.

Stage 5: Cybenko's Universal Approximation

Cybenko (1989) proved that a neural network with a single hidden layer and sigmoid activations can approximate any continuous function on a compact set to arbitrary precision.

Building a Tower from Step Functions

θ(x + s) — step up

θ(−x + s) — step down

Tower = step up + step down − 1

Approximating a Function with Towers

Number of towers: 4

More towers → better approximation of the target function (shown in orange).

Stage 6: The Rise of Deep Networks

While a single hidden layer is theoretically sufficient (Cybenko), deep networks achieve the same expressiveness with exponentially fewer parameters. Depth enables hierarchical feature learning.

Deep Network Architecture

Historical Timeline

1958

Perceptron
Rosenblatt

1969

Perceptrons
Minsky & Papert

1986

Backpropagation
Rumelhart et al.

1989

Universal Approx.
Cybenko

2012

AlexNet
Krizhevsky et al.

2017

Transformers
Vaswani et al.

Key Insight: Depth beats width — deep networks achieve the same expressiveness with exponentially fewer parameters. Each layer learns increasingly abstract representations, enabling hierarchical feature extraction that shallow networks cannot efficiently replicate.