From Scratch: My Micrograd Implementation & Key Takeaways from Karpathy’s Zero to Hero

I spent last few days learn from The spelled-out intro to neural networks and backpropagation: building micrograd. Firstly, the concept of derivative was introduced. Along with differentiate. which is the basic of back propagation - to find out how final result influenced by a parameter, we need the derivative to determine the direction (positive or negative) and strength (absolute value). And by the definition of differentiate, the value of certain derivative can be calculated easily.

A python class Value was created after prerequisite information. It overloads some calculators like *, /, power, etc like this

class Value:
    def __add__(self, other):
        return Value(self.data + other.data)
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data)
        return out

and parameters to keep children nodes, operator, gradient and backward function was added into constructor function

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self._children = set(_children)
        self._op = _op
        self.grad = 0
        self.backward = lambda: None

the gradient was set to default 0 and backward function was set to None,
and all these details was added into operator overloads

class Value:
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, _children=(self. other), _op="+")
        def _backward():
            self.grad = out.grad
            other.grad = out.grad
        out.grad = _backward
        return out

such function can only handle Value(2.0) + 1, but in 1 + Value(2.0) it won’t work, due to no such operator overload in int, so an __radd__ was added here, this method calls __add__ with parameters reversed

class Value:
    def __add__(self, other):
        ...
        return out
    def __radd__(self, other):
        return self + other

to iterate all data nodes, a function of building topological graph was implemented

topo = []
visited = set()
def build_topo(v):
    if v not in visited:
        visited.add(v)
        for child in v.children:
            visited.add(child)
            topo.append(child)
            build_topo(child)

this function will iterate all nodes and add to topo list

In the next step, a Multiple Layer Perceptron was created like this:

import random, math


class Neuron:
    def __init__(self, nin):
        self.ws = [Value(random.normal(-1, 1)) for _ in range(nin)]
        self.b = Value(random.normal(-1, 1))
    def __call__(self, xs):
        out = sum((w*x for w, x in zip(self.ws, xs)), self.b)
        out = math.tanh(out)
        return out
    def parameters(self):
        return self.ws + [self.b]

class Layer:
    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(len(nout))]
    def __call__(self, xs):
        outs = []
        for neuron in self.neurons:
            outs.append(neuron(xs))
        return outs
    def parameters(self):
        params = []
        for neuron in self.neurons:
            params += neuron.parameters()
        return params

class MLP:
    def __init__(self, nin, layersz):
        sz = [nin] + layersz
        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(sz)-1)]
    def __call__(self, xs):
        for layer in self.layers:
            xs = layer(xs)
        return xs
    def parameters(self):
        params = []
        for layer in self.layers:
            params += layer.parameters()
        return params

In this section, it creates a MLP, which is a neural network with multiple layers of neurons,

A neuron is list this, it takes many inputs (nin as in code above), and times weights, add bias, and at last apply an activitate function to the sum and comes to an output neuron

and a layer is like this, it consists of many neurons the count is nout since each neuron only has one output, with same shape but different weights

and MLP, consists of many layers, which takes previous layer’s output as input to every neuron of its , each layer can have different size of neurons, which is defined as the variable layersz in code above

Assume we want to create a MLP with 3 layers, and it has 2, 4, 1 neurons for each layer, and in the begining, it has 4 inputs, we can simply call

layersz = [2, 4, 1]
nin = 4
neural_network = MLP(nin, layersz)

to have a neural network ready, but currently, it barely can do anything, because it’s parameters, the weights and bias of each neuron is random digit from -1 to 1 in code here

self.ws = [Value(random.normal(-1, 1)) for _ in range(nin)]
self.b = Value(random.normal(-1, 1))

to make everything right, now we have to change them, let’s create a dataset, xs stands for a group of x, and yr stands for y real, relative to each x. For example when input [1, 3, 4 ,1] as x, got y = 1 as result.

xs = [
    [1, 3, 4 ,1],
    [2, 3, 5, 7],
    [8 ,6, 3, 2]
]
yr = [1, 2, 9]

to train our neural network, we will input these xs, and compare the output ys to it real value yr, and to measure how close we are to the right answer, we measure that with (yr - ys)^2 as loss, and measure each parameter’s influence to loss, and minimize the value of loss

ys = [neural_network(x) for x in xs]
ys
>>>
[Value(data=-0.5970930530845828),
 Value(data=-0.5396885416549594),
 Value(data=-0.6383425030522946)]

and the loss to current parameters are

loss = sum((ye-y)**2 for ye, y in zip(ys, yr))
loss
>>>
Value(data=101.8983703149689)

to reduce that, we need to adjust each parameter, by how much, is decided by the gradient, it tells us what direction and how much it influences the loss, so we calculate the gradient via backward propagation. and to make loss smaller, we reverse the direction of gradient

loss.backward()

and get all parameters, change them by their gradient value, and step a small step (learning rate) first

learning_rate = 0.001
for param in neural_network.parameters():
    param.data -= param.grad*learning_rate

recalculate loss:

ys = [neural_network(x) for x in xs]
loss = sum((ye-y)**2 for ye, y in zip(ys, yr))
loss
>>>
Value(data=101.15491729894036)

smaller than the previous one, if we repeat this process for 20 times:

for _ in range(20):
    learning_rate = 0.001
    for param in neural_network.parameters():
        param.data -= param.grad*learning_rate
        ys = [neural_network(x) for x in xs]
        loss = sum((ye-y)**2 for ye, y in zip(ys, yr))
        # recall the backward function otherwise the gradient of params won't change
        loss.backward()
        # since we use += in grad in Value class, it needs to be initialized to 0 everytime
        param.grad = 0
loss
>>>
Value(data=65.16298906351358)

it has reduced from 101 to 65. Now we have a neural n--- title: “From Scratch: My Micrograd Implementation & Key Takeaways from Karpathy’s Zero to Hero” published: 2025-06-30 description: ‘Document my implementation of Micrograd’ image: ” tags: [“Neural Network”, “Neural Networks: Zero to Hero”] category: ” draft: false#