Neural Network Architecture Evolution

IME 775 — Mathematical Foundations of Deep Learning · Chapter 7

Stage 1: The Single Perceptron

The simplest neural network: a single computational unit that computes a weighted sum of inputs, adds a bias, and applies a step activation function.

y = θ(w₀x₀ + w₁x₁ + b)

where θ(z) = 1 if z ≥ 0, else 0 (Heaviside step function)

Perceptron Diagram

Decision Boundary

Stage 2: Logic Gates with Perceptrons

A single perceptron can implement linearly separable Boolean functions.

Perceptron Weights

Decision Boundary

Stage 3: The XOR Problem

Minsky & Papert (1969) proved that XOR cannot be solved by a single perceptron. This limitation nearly killed the field of neural networks for over a decade.

XOR Truth Table

x₀x₁XOR
000
011
101
110
XOR is NOT linearly separable.
No single hyperplane (line in 2D) can separate the class-0 and class-1 points. Any line will misclassify at least one point.

Impossible Separation

The dashed lines show attempted decision boundaries — every attempt fails to correctly separate all four points.

Stage 4: Multi-Layer Perceptron Solves XOR

By adding a hidden layer with 2 neurons, we can solve XOR. The hidden layer transforms the input space so that the classes become linearly separable.

MLP Architecture (2→2→1)

Step-Through Computation

Click Next to walk through the forward pass for each input pair.

Stage 5: Cybenko's Universal Approximation

Cybenko (1989) proved that a neural network with a single hidden layer and sigmoid activations can approximate any continuous function on a compact set to arbitrary precision.

Building a Tower from Step Functions

θ(x + s) — step up

θ(−x + s) — step down

Tower = step up + step down − 1

Approximating a Function with Towers

More towers → better approximation of the target function (shown in orange).

Stage 6: The Rise of Deep Networks

While a single hidden layer is theoretically sufficient (Cybenko), deep networks achieve the same expressiveness with exponentially fewer parameters. Depth enables hierarchical feature learning.

Deep Network Architecture

Historical Timeline

1958
Perceptron
Rosenblatt
1969
Perceptrons
Minsky & Papert
1986
Backpropagation
Rumelhart et al.
1989
Universal Approx.
Cybenko
2012
AlexNet
Krizhevsky et al.
2017
Transformers
Vaswani et al.
Key Insight: Depth beats width — deep networks achieve the same expressiveness with exponentially fewer parameters. Each layer learns increasingly abstract representations, enabling hierarchical feature extraction that shallow networks cannot efficiently replicate.