Neural Networks from Scratch, Part 2: Multilayer Perceptrons

In Part 1 we built a scalar autograd engine — a little Value class that records every operation and can backpropagate gradients through the whole graph — and we used it to train a single neuron. It worked, but a single neuron has a hard ceiling on what it can learn, and in this part we’re going to crash straight into that ceiling, then climb over it by stacking neurons into layers. The reward is the first network that deserves the name, and the realization that the backward() we already wrote needs no changes at all to train it.

(This is Part 2 of a series that builds a neural network from scratch, no frameworks, up to a small GPT. If you haven’t read Part 1, start there — we build directly on its Value engine.)

The wall: one neuron can’t learn XOR

Here is the smallest problem in the world that a single neuron cannot solve. XOR — “exclusive or” — is true when exactly one of its two inputs is true:

  x1   x2  | XOR
  -------- | ----
   0    0  |  -1
   0    1  |  +1
   1    0  |  +1
   1    1  |  -1

(I’ve written the labels as −1 and +1 so they live in the range of tanh.) The trouble is that no single straight line can separate the +1 cases from the −1 cases — the two positives sit on opposite corners, with the two negatives on the other diagonal. A neuron computes a weighted sum and squashes it, which geometrically is exactly one straight dividing line. XOR needs more than a line.

Watch it fail. If we take the single-neuron model from Part 1 and train it on XOR, the loss flatlines and never recovers:

single neuron on XOR after 2000 steps: loss = 1.0
preds: [-0.0, 0.0, 0.0, 0.0]   targets: [-1.0, 1.0, 1.0, -1.0]

It doesn’t just do poorly — it gives up and outputs roughly zero for every input, the average of the targets, because that’s the best a single line can do here. The neuron isn’t broken; the model is too simple. We need to compose several lines into a curved boundary, and that means more than one layer.

The idea: layers compose simple boundaries

The fix is a hidden layer. Instead of one neuron staring at the raw inputs, we put a row of neurons in the middle. Each hidden neuron still draws its own straight line, but the output neuron no longer sees x1 and x2 — it sees the outputs of those hidden neurons, and learns to combine them. Two or three lines, combined, can fence off the region XOR needs. Stacking is how simple parts become a complex whole, and it’s the entire reason “deep” learning is deep.

Neurons, layers, and a network

None of this requires touching the engine. We just need three small classes that build expressions out of Values. A Neuron owns a weight per input and a bias, and when you call it on an input it produces the familiar squashed weighted sum:

import random

class Neuron:
    def __init__(self, nin):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(0.0)

    def __call__(self, x):
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        return act.tanh()

    def parameters(self):
        return self.w + [self.b]

A Layer is just a list of neurons, each fed the same input; calling it returns one Value per neuron:

class Layer:
    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(nout)]

    def __call__(self, x):
        outs = [n(x) for n in self.neurons]
        return outs[0] if len(outs) == 1 else outs

    def parameters(self):
        return [p for n in self.neurons for p in n.parameters()]

And an MLP — a multilayer perceptron — is just a list of layers, each feeding the next:

class MLP:
    def __init__(self, nin, nouts):
        sizes = [nin] + nouts
        self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(nouts))]

    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)        # the output of one layer is the input to the next
        return x

    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

That __call__ is the forward pass of the entire network, and here is the quietly profound part: calling model(x) just builds one big Value expression — inputs flowing through dozens of multiplies, adds, and tanhs. The engine from Part 1 sees no difference between that and the single-neuron expression. Backpropagation through a deep network is the exact same backward() we already wrote; the graph is simply larger. We didn’t teach the engine about layers — we didn’t have to.

The training loop (and the bug everyone hits)

Training is almost identical to Part 1, now iterating over model.parameters(). But there’s one line that, if you forget it, will quietly ruin everything — and it’s a direct consequence of how we wrote the engine:

random.seed(42)
model = MLP(2, [8, 1])      # 2 inputs -> hidden layer of 8 -> 1 output
xs = [(0.0,0.0), (0.0,1.0), (1.0,0.0), (1.0,1.0)]
ys = [-1.0, 1.0, 1.0, -1.0]                       # XOR

lr = 0.1
for step in range(300):
    # forward: mean squared error over the four examples
    loss = Value(0.0)
    for x, y in zip(xs, ys):
        diff = model(x) - y
        loss = loss + diff*diff
    loss = loss * (1.0/len(xs))

    for p in model.parameters():   # <-- ZERO THE GRADIENTS FIRST
        p.grad = 0.0
    loss.backward()

    for p in model.parameters():   # gradient-descent step
        p.data += -lr * p.grad

    if step % 50 == 0:
        print(f"step {step:3d}  loss {loss.data:.4f}")

That p.grad = 0.0 loop matters because of a decision we made back in Part 1: every operation accumulates into .grad with +=, not =. That was deliberate — a value used in several places must sum the gradients from all of them. But it means gradients don’t reset on their own; if you skip the zeroing, this step’s gradient piles on top of every previous step’s, the updates blow up, and training diverges into nonsense. “Forgot to zero the gradients” is a rite of passage; now you know why it happens.

Eight neurons do what one couldn’t

Run it, and the wall we hit earlier simply isn’t there anymore:

step   0  loss 1.2434
step  50  loss 0.2526
step 100  loss 0.0686
step 150  loss 0.0326
step 200  loss 0.0201
step 250  loss 0.0142

preds: [-0.922, 0.896, 0.891, -0.88]   targets: [-1.0, 1.0, 1.0, -1.0]

The loss falls from 1.24 to about 0.01, and the predictions land firmly on the right side of zero for all four cases — the exact problem that pinned a single neuron at a loss of 1.0. With one hidden layer of eight neurons (thirty-three parameters in total), the network learned to bend a straight boundary into one that wraps around XOR’s diagonal. The hidden neurons discovered useful intermediate lines; the output neuron learned how to combine them.

What actually changed

Almost nothing, and that’s the lesson. We added no new math and not a single line to the autograd engine. We wrote three tiny container classes, pointed the same training loop at a bigger pile of parameters, and a network that was previously impossible became routine. This is the whole shape of deep learning in miniature: the learning algorithm doesn’t get smarter as networks get deeper — it just gets more knobs to turn, and composition does the rest. The same forward-then-backward()-then-nudge loop, scaled up, is what trains a model with billions of parameters.

There’s a soft spot in what we have, though. We’re using tanh outputs and squared-error loss to do what is really a classification task, and that pairing is clumsy — it learns slowly and says nothing about confidence. In Part 3 we’ll fix the output end of the network: meet the softmax function and the cross-entropy loss, the natural language for “which class, and how sure?” That upgrade is also exactly what we’ll need to start predicting the next character in a sequence — the first real step toward a language model.