I spent last few days learn from The spelled-out intro to neural networks and backpropagation: building micrograd. Firstly, the concept of derivative was introduced. Along with differentiate. which is the basic of back propagation - to find out how final result influenced by a parameter, we need the derivative to determine the direction (positive or negative) and strength (absolute value). And by the definition of differentiate, the value of certain derivative can be calculated easily.
A python class Value
was created after prerequisite information. It overloads some calculators like *, /, power, etc like this
class Value:
def __add__(self, other):
return Value(self.data + other.data)
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data)
return out
and parameters to keep children nodes, operator, gradient and backward function was added into constructor function
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data
self._children = set(_children)
self._op = _op
self.grad = 0
self.backward = lambda: None
the gradient was set to default 0 and backward function was set to None,
and all these details was added into operator overloads
class Value:
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, _children=(self. other), _op="+")
def _backward():
self.grad = out.grad
other.grad = out.grad
out.grad = _backward
return out
such function can only handle Value(2.0) + 1
, but in 1 + Value(2.0)
it won’t work, due to no such operator overload in int
, so an __radd__
was added here, this method calls __add__
with parameters reversed
class Value:
def __add__(self, other):
...
return out
def __radd__(self, other):
return self + other
to iterate all data nodes, a function of building topological graph was implemented
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v.children:
visited.add(child)
topo.append(child)
build_topo(child)
this function will iterate all nodes and add to topo list
In the next step, a Multiple Layer Perceptron was created like this:
import random, math
class Neuron:
def __init__(self, nin):
self.ws = [Value(random.normal(-1, 1)) for _ in range(nin)]
self.b = Value(random.normal(-1, 1))
def __call__(self, xs):
out = sum((w*x for w, x in zip(self.ws, xs)), self.b)
out = math.tanh(out)
return out
def parameters(self):
return self.ws + [self.b]
class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(len(nout))]
def __call__(self, xs):
outs = []
for neuron in self.neurons:
outs.append(neuron(xs))
return outs
def parameters(self):
params = []
for neuron in self.neurons:
params += neuron.parameters()
return params
class MLP:
def __init__(self, nin, layersz):
sz = [nin] + layersz
self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(sz)-1)]
def __call__(self, xs):
for layer in self.layers:
xs = layer(xs)
return xs
def parameters(self):
params = []
for layer in self.layers:
params += layer.parameters()
return params
In this section, it creates a MLP, which is a neural network with multiple layers of neurons,
A neuron is list this, it takes many inputs (nin
as in code above), and times weights, add bias, and at last apply an activitate function to the sum and comes to an output
and a layer is like this, it consists of many neurons the count is nout
since each neuron only has one output, with same shape but different weights

and MLP, consists of many layers, which takes previous layer’s output as input to every neuron of its , each layer can have different size of neurons, which is defined as the variable layersz
in code above
Assume we want to create a MLP with 3 layers, and it has 2, 4, 1 neurons for each layer, and in the begining, it has 4 inputs, we can simply call
layersz = [2, 4, 1]
nin = 4
neural_network = MLP(nin, layersz)
to have a neural network ready, but currently, it barely can do anything, because it’s parameters, the weights and bias of each neuron is random digit from -1 to 1 in code here
self.ws = [Value(random.normal(-1, 1)) for _ in range(nin)]
self.b = Value(random.normal(-1, 1))
to make everything right, now we have to change them, let’s create a dataset, xs stands for a group of x, and yr stands for y real, relative to each x. For example when input [1, 3, 4 ,1] as x, got y = 1 as result.
xs = [
[1, 3, 4 ,1],
[2, 3, 5, 7],
[8 ,6, 3, 2]
]
yr = [1, 2, 9]
to train our neural network, we will input these xs, and compare the output ys to it real value yr, and to measure how close we are to the right answer, we measure that with (yr - ys)^2 as loss, and measure each parameter’s influence to loss, and minimize the value of loss
ys = [neural_network(x) for x in xs]
ys
>>>
[Value(data=-0.5970930530845828),
Value(data=-0.5396885416549594),
Value(data=-0.6383425030522946)]
and the loss to current parameters are
loss = sum((ye-y)**2 for ye, y in zip(ys, yr))
loss
>>>
Value(data=101.8983703149689)
to reduce that, we need to adjust each parameter, by how much, is decided by the gradient, it tells us what direction and how much it influences the loss, so we calculate the gradient via backward propagation. and to make loss smaller, we reverse the direction of gradient
loss.backward()
and get all parameters, change them by their gradient value, and step a small step (learning rate) first
learning_rate = 0.001
for param in neural_network.parameters():
param.data -= param.grad*learning_rate
recalculate loss:
ys = [neural_network(x) for x in xs]
loss = sum((ye-y)**2 for ye, y in zip(ys, yr))
loss
>>>
Value(data=101.15491729894036)
smaller than the previous one, if we repeat this process for 20 times:
for _ in range(20):
learning_rate = 0.001
for param in neural_network.parameters():
param.data -= param.grad*learning_rate
ys = [neural_network(x) for x in xs]
loss = sum((ye-y)**2 for ye, y in zip(ys, yr))
# recall the backward function otherwise the gradient of params won't change
loss.backward()
# since we use += in grad in Value class, it needs to be initialized to 0 everytime
param.grad = 0
loss
>>>
Value(data=65.16298906351358)
it has reduced from 101 to 65. Now we have a neural n--- title: “From Scratch: My Micrograd Implementation & Key Takeaways from Karpathy’s Zero to Hero” published: 2025-06-30 description: ‘Document my implementation of Micrograd’ image: ” tags: [“Neural Network”, “Neural Networks: Zero to Hero”] category: ” draft: false
I spent last few days learn from The spelled-out intro to neural networks and backpropagation: building micrograd. Firstly, the concept of derivative was introduced. Along with differentiate. which is the basic of back propagation - to find out how final result influenced by a parameter, we need the derivative to determine the direction (positive or negative) and strength (absolute value). And by the definition of differentiate, the value of certain derivative can be calculated easily.
A python class Value
was created after prerequisite information. It overloads some calculators like *, /, power, etc like this
class Value:
def __add__(self, other):
return Value(self.data + other.data)
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data)
return out
and parameters to keep children nodes, operator, gradient and backward function was added into constructor function
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data
self._children = set(_children)
self._op = _op
self.grad = 0
self.backward = lambda: None
the gradient was set to default 0 and backward function was set to None,
and all these details was added into operator overloads
class Value:
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, _children=(self. other), _op="+")
def _backward():
self.grad = out.grad
other.grad = out.grad
out.grad = _backward
return out
such function can only handle Value(2.0) + 1
, but in 1 + Value(2.0)
it won’t work, due to no such operator overload in int
, so an __radd__
was added here, this method calls __add__
with parameters reversed
class Value:
def __add__(self, other):
...
return out
def __radd__(self, other):
return self + other
to iterate all data nodes, a function of building topological graph was implemented
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v.children:
visited.add(child)
topo.append(child)
build_topo(child)
this function will iterate all nodes and add to topo list
In the next step, a Multiple Layer Perceptron was created like this:
import random, math
class Neuron:
def __init__(self, nin):
self.ws = [Value(random.normal(-1, 1)) for _ in range(nin)]
self.b = Value(random.normal(-1, 1))
def __call__(self, xs):
out = sum((w*x for w, x in zip(self.ws, xs)), self.b)
out = math.tanh(out)
return out
def parameters(self):
return self.ws + [self.b]
class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(len(nout))]
def __call__(self, xs):
outs = []
for neuron in self.neurons:
outs.append(neuron(xs))
return outs
def parameters(self):
params = []
for neuron in self.neurons:
params += neuron.parameters()
return params
class MLP:
def __init__(self, nin, layersz):
sz = [nin] + layersz
self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(sz)-1)]
def __call__(self, xs):
for layer in self.layers:
xs = layer(xs)
return xs
def parameters(self):
params = []
for layer in self.layers:
params += layer.parameters()
return params
In this section, it creates a MLP, which is a neural network with multiple layers of neurons,
A neuron is list this, it takes many inputs (nin
as in code above), and times weights, add bias, and at last apply an activitate function to the sum and comes to an output
and a layer is like this, it consists of many neurons the count is nout
since each neuron only has one output, with same shape but different weights

and MLP, consists of many layers, which takes previous layer’s output as input to every neuron of its , each layer can have different size of neurons, which is defined as the variable layersz
in code above
Assume we want to create a MLP with 3 layers, and it has 2, 4, 1 neurons for each layer, and in the begining, it has 4 inputs, we can simply call
layersz = [2, 4, 1]
nin = 4
neural_network = MLP(nin, layersz)
to have a neural network ready, but currently, it barely can do anything, because it’s parameters, the weights and bias of each neuron is random digit from -1 to 1 in code here
self.ws = [Value(random.normal(-1, 1)) for _ in range(nin)]
self.b = Value(random.normal(-1, 1))
to make everything right, now we have to change them, let’s create a dataset, xs stands for a group of x, and yr stands for y real, relative to each x. For example when input [1, 3, 4 ,1] as x, got y = 1 as result.
xs = [
[1, 3, 4 ,1],
[2, 3, 5, 7],
[8 ,6, 3, 2]
]
yr = [1, 2, 9]
to train our neural network, we will input these xs, and compare the output ys to it real value yr, and to measure how close we are to the right answer, we measure that with (yr - ys)^2 as loss, and measure each parameter’s influence to loss, and minimize the value of loss
ys = [neural_network(x) for x in xs]
ys
>>>
[Value(data=-0.5970930530845828),
Value(data=-0.5396885416549594),
Value(data=-0.6383425030522946)]
and the loss to current parameters are
loss = sum((ye-y)**2 for ye, y in zip(ys, yr))
loss
>>>
Value(data=101.8983703149689)
to reduce that, we need to adjust each parameter, by how much, is decided by the gradient, it tells us what direction and how much it influences the loss, so we calculate the gradient via backward propagation. and to make loss smaller, we reverse the direction of gradient
loss.backward()
and get all parameters, change them by their gradient value, and step a small step (learning rate) first
learning_rate = 0.001
for param in neural_network.parameters():
param.data -= param.grad*learning_rate
recalculate loss:
ys = [neural_network(x) for x in xs]
loss = sum((ye-y)**2 for ye, y in zip(ys, yr))
loss
>>>
Value(data=101.15491729894036)
smaller than the previous one, if we repeat this process for 20 times:
for _ in range(20):
learning_rate = 0.001
for param in neural_network.parameters():
param.data -= param.grad*learning_rate
ys = [neural_network(x) for x in xs]
loss = sum((ye-y)**2 for ye, y in zip(ys, yr))
# recall the backward function otherwise the gradient of params won't change
loss.backward()
# since we use += in grad in Value class, it needs to be initialized to 0 everytime
param.grad = 0
loss
>>>
Value(data=65.16298906351358)
it has reduced from 101 to 65. Now we have a neural network for given dataset relative better than a random oneetwork for given dataset relative better than a random one