Neural Networks from Scratch, Part 3: Softmax and Cross-Entropy

By the end of Part 2 we had a multilayer perceptron that could learn XOR, trained with tanh outputs and a squared-error loss. That pairing got us off the ground, but it’s the wrong tool for the job we’re actually heading toward. XOR was really a classification question — “which class does this belong to?” — and squared error answers a different question, “how close is this number?” In this part we fix the output end of the network properly, with the two ideas every classifier (and every language model) is built on: softmax and cross-entropy. Along the way we’ll teach our autograd engine three new tricks.

(Part 3 of a from-scratch, no-frameworks series. We build on the Value engine from Part 1 and the MLP from Part 2.)

Why squared error is the wrong loss

Suppose a network must sort inputs into three classes. With squared error you’d force it to regress toward numeric labels — 0, 1, 2 — which quietly tells the model that class 2 is “twice” class 1 and that 1 sits “between” 0 and 2. That ordering is a fiction; the classes are just names. What we really want is for the network to output a probability distribution over the classes — “70% class A, 20% B, 10% C” — and a loss that rewards putting probability on the right answer. Squared error gives us neither. Softmax and cross-entropy give us both.

Softmax: scores into a distribution

We let the network’s final layer output raw, unbounded scores — one per class, called logits — and then convert them into probabilities with the softmax function: exponentiate each logit (making everything positive) and divide by the total (making it sum to one).

def softmax(logits):
    m = max(l.data for l in logits)        # subtract the max for numeric stability
    exps = [(l - m).exp() for l in logits]
    s = sum(exps, Value(0.0))
    return [e / s for e in exps]

The result is a genuine probability distribution: every entry is positive, and they sum to one. Softmax is a “soft” version of argmax — the largest logit gets the most probability, but the others keep a share proportional to how close they came. That subtraction of the maximum changes nothing mathematically (it cancels in the ratio) but keeps exp from overflowing on large logits — a small trick worth building in from the start.

Cross-entropy: the loss that fits

If the network outputs a distribution, the natural loss asks a simple question: what probability did you assign to the correct answer? Cross-entropy is the negative logarithm of that probability:

def cross_entropy(logits, target):
    probs = softmax(logits)
    return -probs[target].log()

When the model is confident and right — probability near 1 — the log is near 0 and the loss is tiny. When it is confident and wrong — probability near 0 for the true class — the log dives toward negative infinity and the loss explodes. That asymmetry is exactly what you want: it punishes confident mistakes far more than hesitant ones, which pushes the network toward being calibrated, not just correct. Minimizing cross-entropy is the same as maximizing the likelihood of the training labels — it is maximum-likelihood estimation wearing a different hat.

Teaching the engine to exp, log, and divide

Softmax needs exp and division; cross-entropy needs log. Our Part 1 engine only knew +, *, and tanh, so we extend Value with each new operation and its one-line local derivative — the same recipe as before: compute the result, and close over a _backward that knows the rule from calculus.

    def exp(self):
        e = math.exp(self.data)
        out = Value(e, (self,), 'exp')
        def _backward():
            self.grad += e * out.grad           # d(e^x)/dx = e^x
        out._backward = _backward
        return out

    def log(self):
        out = Value(math.log(self.data), (self,), 'log')
        def _backward():
            self.grad += (1.0/self.data) * out.grad   # d(ln x)/dx = 1/x
        out._backward = _backward
        return out

    def __pow__(self, k):                       # k is a constant exponent
        out = Value(self.data ** k, (self,), f'**{k}')
        def _backward():
            self.grad += k * self.data**(k-1) * out.grad   # d(x^k)/dx = k*x^(k-1)
        out._backward = _backward
        return out

    def __truediv__(self, other):               # a / b  ==  a * b^-1
        return self * (other ** -1)

That’s the whole extension — four short methods. Division is the slickest: rather than write a new rule, we define a / b as a * b⁻¹ and let the existing * and the new ** handle the gradient by composition. Because every piece is built from differentiable primitives, softmax and cross-entropy are differentiable end to end the moment we write them — we never differentiate them by hand. (I checked all four new rules against finite-difference gradients before trusting them; they match to one part in a billion.)

A real classifier

One small change to the network: the output layer should emit raw logits, not squashed tanh values, so we let a Neuron skip its nonlinearity and have the MLP make only its last layer linear:

class Neuron:
    def __init__(self, nin, nonlin=True):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(0.0)
        self.nonlin = nonlin
    def __call__(self, x):
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        return act.tanh() if self.nonlin else act   # last layer: raw logits

# in MLP.__init__, make only the final layer linear:
#   nonlin = (i != len(nouts) - 1)

Now the training loop looks almost exactly like Part 2 — forward, zero the gradients, backward(), nudge — except the loss is cross-entropy summed over the data. Here it is on three little clusters of points, one per class:

model = MLP(2, [16, 3])          # 2 inputs -> hidden 16 -> 3 logits
lr = 0.1
for step in range(120):
    loss = Value(0.0)
    for x, y in data:                       # y is the class index 0, 1, or 2
        loss = loss + cross_entropy(model(x), y)
    loss = loss * (1.0/len(data))

    for p in model.parameters():
        p.grad = 0.0
    loss.backward()
    for p in model.parameters():
        p.data += -lr * p.grad

    if step % 20 == 0:
        print(f"step {step:3d}  loss {loss.data:.4f}")

step   0  loss 2.7997
step  20  loss 0.0415
step  40  loss 0.0225
step  60  loss 0.0158
step  80  loss 0.0122
step 100  loss 0.0100

accuracy: 24/24 = 100%

The loss falls from 2.80 — actually worse than the ln(3) ≈ 1.10 a uniform guesser would score, because the randomly initialized weights make confident wrong bets out of the gate — down to under 0.01, and every point is classified correctly. To read off a prediction we just take the class with the highest logit; softmax preserves that order, so we don’t even need to normalize at inference time.

Why this is the whole ballgame for LLMs

Here is the payoff, and the reason this part matters more than it looks. Predicting the next token in a sequence is just classification over the vocabulary. Show a language model the text so far, have it emit one logit per possible next token, softmax them into a distribution, and train it to put probability on the token that actually came next — with cross-entropy. That is exactly the machinery we just built; the only differences are that the “classes” number in the tens of thousands and the inputs are sequences rather than points. Every large language model on earth is trained by minimizing the cross-entropy of next-token prediction.

We now have the complete toolkit: an autograd engine, networks built from it, and the right loss for predicting symbols. In Part 4 we’ll point it at text for the first time and build the simplest possible language model — a character-level model that learns the statistics of one character following another, and samples brand-new (if slightly unhinged) text of its own. The road to a GPT runs straight through the loss we just wrote.