Now that we know how neural networks function, and have built up some intuition about how to train them with backpropagation, it's time to look at how we can describe it mathematically.
The best way to learn how something works mathematically is to understand the why behind something; which is why I want you to go through the extra trouble of deriving the mathematics.
Warning: This essay is going to be heavy on the math. If you're allergic to math, you might want to check out the more intuitive version (coming soon), but with that said, backpropagation is really not as hard as you might think.
Backpropagation is not as hard as you might think
This article assumes familiarity with forward propagation, and neural networks in general. If you haven't already, I recommend reading What is a Neural Network first.
Recall that the weights in a neural network are updated by minimizing an error function that describes how wrong the neural networks current hypothesis is.
If is the input for the first layer, and is the number of layers in the network, then is the network's hypothesis: . Finally, let be the number of examples, the number of neurons in layer , and let be the correct answer given the input .
In order for an error function to be suitable for backpropagation, the average error should be computable using:
Where is the error function for a specific example.
This is nescessary if the backpropagation procedure is to update the weights on the basis of more than one example which will result in a more direct route toward convergence, and is generally preferred.
Furthermore, we assume that the error function can be written as a function of the network's hypothesis , and the correct answer .
One simple error function that satisfies thse requirements, and which you probably already know, is the mean squared error (MSE) defined as:
For a single example, and for multiple examples:
We see that not only does it satisfy the averaging constraint, but it also only depends on the hypothesis (noted as ), and the actual answer (noted as ).
For notational simplicity, for the rest of the essay, we will omit the function variables, so becomes .
Recall that we use backpropagation to find the individual weights' contribution to the error function which is used during gradient descent when updating the weights.
Backpropagation is just figuring out how awful each weight is
In other words, backpropagation attempts to find:
In order to find this, we introduce a new variable which is the error-sum of neuron in layer ; somtimes called the delta error. The delta error is defined as:
Recall that is the raw output signal of a neuron in the last layer before the activation function has been applied.
During backpropagation, we will find a way of computing the delta error, and translate it to .
We will derive in three steps using the three equations summarized below:
- Find a way of computing to initialize the process.
- Find a way of computing given .
- Find a way of computing given
Recall that we can differentiate composite functions using the chain rule by:
The same principle holds for nested composite functions.
Using the chain rule, we can reformulate in terms of the partial derivative of the activation function as :
We can simplify the above to:
We can vectorize the simplified equation by collecting in a vector gradient for each layer, . Similarily, we can collect the raw output into a vector of all the raw outputs in each layer, .
By doing so, we find the first equation:
Where is the Hadamard product; elementwise multiplication.
While equation describes the error in the last layer: , and , equation describes the error of a layer in terms of the errors of the layers in front of it.
In order to achieve this, we rewrite in terms of the next layer, :
Once again, we use the chain rule.
Where is the number of neurons in layer .
Since , we can rewrite the above as:
This works because the error from the previous layers is carried over to the neurons in the later layers. This is also why we sum over all the neurons. Equation can be interpreted as the total error caused by expressed by .
We know from forward propagation that:
And since , we can rewrite the above as:
By differentiating with respect to , we find:
By substituting this expression in equation , we find:
If this is not obvious, I do encourage you to spend some time going through the equations in order to convince yourself that this is correct.
Finally, by vectorizing the above, we arrive at the final form for equation :
Equation is derived in the exact same way as equation , and using the chain rule, so I'll simply state the final form:
In equation , we see that an individual weight's contribution to the error function is equal to the scaled error it sends forward in the network.
Intuitively, this should make sense.
If we think of the error as throwing balls at a target, and if the percentage of balls missing the target is the delta error, and the rate of throwing is the activation, then the total number of balls missing the target, is those multiplied together - which is exactly what we do.
is how much a neuron is stimulated - how strong the output is, or the rate of throwing.
is our throwing accuracy, or rather, in a team of athletes trying to hit the target, how much an individual contributes to the overall number of balls that didn't hit the target.
Finally, we can confirm that this also works for the bias unit where . It should just give us as there's no activation coefficient:
Which we see it does.
Using these three equations, we can now describe the algorithm for backpropagation in a feed forward layer.
- We use equation to calculate the delta error of the last layer.
- We use the delta error of the last layer to initialize a recursive process of calculating the delta error of all the previous layers using equation :
- We use the delta errors with equation to calculate the derivative of the error function with respect to each weight in the neural network which can be used in gradient descent:
Finally, equation and can be combined into one recursive equation:
And that's it. You now know everything there's to know about how backpropagation works.
Don't worry if you don't immediately understand it; that's normal. Put this essay away, and come back after a couple of days to review it, and do a couple of exercices.
Do this a couple of times, and your brain should start to pick it up, and you will be become more comfortable with backpropagation.