Neural Networks from Scratch, Part 1: The Neuron and the Chain Rule

Large language models can feel like magic, and magic is a terrible way to understand anything. So I want to do the opposite: build a neural network up from a single multiplication, with no frameworks — no PyTorch, no TensorFlow, not even NumPy at first — just plain Python and a willingness to compute every derivative ourselves. By the end of this series we will have written, by hand, the same machinery that powers a GPT: a small transformer that reads tinyshakespeare and dreams up more of it. But we earn that one multiply at a time.

This is Part 1. Its entire goal is to answer the question that every neural network is secretly built around: if I wiggle this number, which way does the answer move? Get that, and you understand backpropagation. Everything after is scale.

Where we’re headed

The series climbs a ladder, each rung a small, runnable program:

The neuron and the chain rule — a scalar autograd engine; train one neuron. (you are here)
Multilayer perceptrons — stack neurons into layers and backprop through them.
Making it learn well — softmax, cross-entropy, and a real training loop.
Language modeling, the simplest version — a character-level model that samples text.
Embeddings and context — turning symbols into vectors.
Graduating to tensors — the same ideas, vectorized with NumPy.
Attention from scratch — keys, queries, and values.
The transformer block — multi-head attention, residuals, layer norm.
A tiny GPT — assemble it all and generate Shakespeare.

A neuron is a very small function

Strip away the mystique and a neuron is just a weighted sum of its inputs, plus a bias, squashed through a nonlinearity. With two inputs:

import math

def neuron(x1, x2, w1, w2, b):
    act = x1*w1 + x2*w2 + b   # the weighted sum
    return math.tanh(act)     # the squashing nonlinearity

print(neuron(2.0, 1.0, w1=-0.6, w2=0.8, b=0.1))  # -> -0.2913...

That tanh is what makes it interesting: it bends the straight line of the weighted sum into an S-curve that saturates toward −1 and +1, which is what lets a network of these represent things a plain line never could. The weights w1, w2 and the bias b are the knobs. “Learning” is nothing more than turning those knobs until the neuron does what we want.

Learning is just reducing a loss

To turn the knobs in the right direction we need a number that says how wrong we currently are — a loss. For a handful of examples, each with an input and a desired output, the classic choice is the mean squared error:

def loss(params):
    w1, w2, b = params
    xs = [(2.0, 1.0), (1.0, -1.0), (-1.0, 1.0), (-2.0, -1.0)]
    ys = [1.0, -1.0, -1.0, -1.0]               # the targets
    total = 0.0
    for (x1, x2), y in zip(xs, ys):
        pred = neuron(x1, x2, w1, w2, b)
        total += (pred - y) ** 2               # squared error
    return total / len(xs)

When the loss is large the neuron is wrong; when it’s near zero it’s right. So the whole problem reduces to a single question: how should I change each weight to make this loss go down?

The derivative is the answer to “which way?”

The derivative of the loss with respect to a weight is exactly that: if I nudge this weight up a hair, does the loss go up or down, and how fast? If we know that for every weight, we just step each one a little in the direction that lowers the loss, and repeat. That’s gradient descent.

Before we get clever, here’s the dumbest possible way to measure a derivative — literally wiggle the input and watch the output. It’s too slow for real use, but it’s an unbeatable sanity check, and it makes the idea concrete:

def numerical_gradient(params, i, h=1e-6):
    up = list(params); up[i] += h
    dn = list(params); dn[i] -= h
    return (loss(up) - loss(dn)) / (2*h)   # rise over run

params = [-0.6, 0.8, 0.1]
print([round(numerical_gradient(params, i), 4) for i in range(3)])
# -> [gradients for w1, w2, b]

This works, but imagine doing it for a network with a hundred million weights: you’d have to re-run the entire model a hundred million times per step. Useless. We need a way to get all the derivatives from a single backward pass. That way is the chain rule.

The chain rule, and why it’s the whole game

Our loss isn’t one operation; it’s a chain of them: multiply, add, tanh, subtract, square. The chain rule says that to find how the final output depends on some input deep inside, you multiply the local derivatives along the path connecting them. Each operation only needs to know how to differentiate itself — its tiny local rule — and the chain rule stitches those local rules into a global one.

That’s the entire trick behind backpropagation: build the computation as a graph, then walk it backward from the output, multiplying local derivatives as you go, handing each node its share of the gradient. Let’s make the computer do the bookkeeping.

Building a tiny autograd engine

We’ll wrap every number in a small object that remembers how it was produced — which numbers and which operation made it — and knows its own local derivative rule. This is the heart of every deep-learning framework, in about forty lines.

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0              # d(loss)/d(self), filled in by backward()
        self._backward = lambda: None
        self._prev = set(_children)  # the Values that produced this one
        self._op = _op

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad  += out.grad   # d(a+b)/da = 1
            other.grad += out.grad   # d(a+b)/db = 1
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad  += other.data * out.grad   # d(a*b)/da = b
            other.grad += self.data  * out.grad   # d(a*b)/db = a
        out._backward = _backward
        return out

    def tanh(self):
        t = math.tanh(self.data)
        out = Value(t, (self,), 'tanh')
        def _backward():
            self.grad += (1 - t**2) * out.grad    # d(tanh)/dx = 1 - tanh^2
        out._backward = _backward
        return out

Notice the pattern: every operation computes its result and closes over a _backward function that knows how to push gradient from the output back to its inputs. Addition copies the gradient through; multiplication swaps the data of the two operands; tanh scales by 1 - tanh². Each is a one-line fact from calculus.

The only thing left is to run those local rules in the right order — every node must receive its full gradient from downstream before it passes any upstream. That ordering is a topological sort of the graph, and then a single reversed walk:

    def backward(self):
        topo, visited = [], set()
        def build(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build(child)
                topo.append(v)      # children come before parents
        build(self)

        self.grad = 1.0             # d(loss)/d(loss) = 1, the seed
        for v in reversed(topo):    # walk from output back to inputs
            v._backward()

A couple of Python conveniences — __radd__, __rmul__, __neg__, __sub__ — let us write x*w + b and pred - y with ordinary operators, and that’s the engine. It can differentiate any expression you build out of +, *, and tanh, no matter how tangled, in one pass.

Training one neuron

Now the payoff. We rebuild the neuron out of Values, compute the loss, call backward() once to fill in every gradient, and nudge each parameter downhill. Repeat a couple hundred times.

import random
random.seed(1)

xs = [(2.0, 1.0), (1.0, -1.0), (-1.0, 1.0), (-2.0, -1.0)]
ys = [1.0, -1.0, -1.0, -1.0]

w = [Value(random.uniform(-1, 1)), Value(random.uniform(-1, 1))]
b = Value(0.0)

def forward(x):
    return (w[0]*x[0] + w[1]*x[1] + b).tanh()

lr = 0.1
for step in range(200):
    # forward pass: mean squared error as one big expression
    loss = Value(0.0)
    for x, y in zip(xs, ys):
        diff = forward(x) - y
        loss = loss + diff*diff
    loss = loss * (1.0/len(xs))

    for p in w + [b]:        # reset gradients
        p.grad = 0.0
    loss.backward()          # fill them all in, in one pass

    for p in w + [b]:        # gradient-descent step
        p.data += -lr * p.grad

    if step % 40 == 0:
        print(f"step {step:3d}  loss {loss.data:.4f}")

Run it, and you watch a number learn:

step   0  loss 2.2508
step  40  loss 0.0389
step  80  loss 0.0186
step 120  loss 0.0120
step 160  loss 0.0088

The loss falls from 2.25 to under 0.01, and the neuron’s predictions snap to the targets. Nobody told it the weights; it found them by repeatedly asking “which way?” and taking a small step. That loop — forward, backward(), nudge — is, with more knobs and fancier pieces bolted on, exactly how a GPT is trained.

What we built, and what’s next

In one sitting we wrote a working automatic-differentiation engine and used it to train a neuron from nothing. The crucial idea — the one that scales all the way to billion-parameter models — is that backpropagation is just the chain rule, applied mechanically backward over a graph of tiny local derivatives. There is no magic in the gradient; there is only bookkeeping done carefully.

One neuron can only carve the world with a single line. In Part 2 we’ll wire many of them into layers, stack the layers into a multilayer perceptron, and watch the same backward() we just wrote train a network that can learn shapes a single neuron never could. Same engine, more knobs — that’s the whole road to the LLM.