What Is A Neural Network?
Overview and introduction to feed forward neural networks. Forward propagation is discussed in detail, and we see how we might train a network.
In recent years, neural networks have shown great potential across a wide range of industries with access to large quantities of data, and have come to dominate most other machine learning algorithms.
In the following essays, we will analyse how neural networks work, their effectiveness, and how to use them on actual practical problems.
It's assumed that you have basic familiarity with machine learning in general, and have a good grasp on high school level math.
The overarching goal of neural networks, and machine learning in general, is to find the hypothesis
The goal of Machine Learning is to find a model that mimics the true distribution of the training data.
For the sake of simplicity, we will only consider a special type of neural network called a feed forward network in this essay. In later entries, we will also consider more complex architecures.
Once we understand feed forward neural networks, the same principles easily transfers to the more complicated networks such as convolutional, and recurrent neural networks.
A deep feed forward network consists of a series of neurons organized in layers. Each neuron can be thought of as a computational unit that does a simple calculation based on the input it receives.
The neurons are connected in such a way that they receive their input from all neurons in the previous layer, and send their output to all the neurons in the next layer as illustrated on the figure above.
The output of the neurons in the final layer is the network's final hypothesis,
But what kind of computation do the neurons do, and how exactly do we compute the final hypothesis?
Neural Network Inference
The best way to understand what happens inside a neuron is to show it, which is why I've sketched the calculation done by a single neuron (the yellow one) below:
We here see that the neuron computes the sum of the weighted activations of all the neurons in the previous layer, and then add some bias. It then applies a mysterious
A more general form of the calculation above can be written as:
Since the weights are the only thing that seem to change between neurons, you might think that the weights are an important factor in training the network, and you'd be right which we will see later.
In order to compute the network's hypothesis
But how do we initialize the process? Clearly, we will run out of previous layers eventually.
To initialize the process, we pass in our data,
This computation is called forward propagation cleverly named so because you propagate forwards through the network. And you say that mathematicians are bad at naming things.
Matrix Based Neural Networks
While you can do all calculations regarding neural networks without using matrices, it can significantly simplify the notation, and, as we will see later, it is also more time efficient which is why it will be used from now on.
Recall that elements of a matrix,
The weights in each layer of the network can be encoded in a matrx
Matrices have a number of operations defined the most important of which for neural networks being matrix multiplication.
Matrix multiplication is somewhat different from scalar multiplication as it's not commutative.
Where the elements of matrix
Using matrix multiplication, we can write the big scary expression
Which is a lot less scary.
Furthermore, we can write almost the entire forward propagation procedure as:
While we are almost there, we haven't yet added the bias term.
To do that in matrix form, we can insert another element in the output,
And that's it. That is the complete forward propagation procedure on matrix form.
Feed forward neural networks are essentially just a bunch of matrix multiplications
For a more complete description, you can read the recap on matrices (coming soon).
Until now, I've teased the mysterious activation function
The purpose of the activation function is to introduce non-linearity into the network which enables it to encode more complex relationships.
If you've taken an introductory course on differential equations, you've probably already encountered it in a more general form as logistic growth.
This function limits the output to be between 0 and 1 which is ideal for binary classification problems.
As activation functions only takes real-valued input, they are applied elementwise when using the matrix-notation.
To solidify our understanding of neural networks, let's consider a very basic toy example.
Suppose we have just two input neurons, a hidden layer with one neuron, and a single output neuron as depicted below:
From looking at the illustration, we see that the input vector has two neurons, and two weights. It should be mentioned that I've not drawn the bias unit for any of the layers here.
The hidden neuron has just a single weight associated with it because there's only one output neuron.
How do we find the value of the output neuron?
Since there are two input neurons, we know that the input,
We can find the output value of the first neurons by simply applying the activation function
We can now calculate the output of the hidden neuron by summing the weighted outputs of the input neurons:
We can now find the output of the hidden neuron like we did with the first layer.
This process can be repeated for the ouput layer as well to find the network's final hypothesis.
You might ask what can this be used for?
One thing that it can be used for is encoding the OR function. Recall the truthtable for OR:
If at least one input is 1, then the whole output is True, or 1.
If we let the activation function for the first two layers be:
Which is essentially ignoring the activation, and sigmoid for the output layer, we can use the weights:
To verify that these weights are correct, we can look at the value of the hidden neuron:
Between the hidden neuron, and the output neuron, we apply the sigmoid activation function, so we end up with:
Which is what we wanted.
In fact, since we could have applied the sigmoid function already on the hidden neuron, it would still work if we removed the hidden unit.
A problem where you have to use a hidden layer is the XOR function. Recall the truth table of XOR:
I won't go through what the correct of weights are here, but I encourage you to try figure it out yourself.
Hint: You can use the same architecture as above, but with two hidden neurons.
Until now, we have manually found the values of the weights, but this quickly becomes impractical for large networks, so how do we make the computer automagically find an optimal set of weights?
Training A Neural Network
Neural networks support two modes of operation:
- Inference (Forward propagation)
- Training (Backpropagation)
So far, we have only looked at the inference phase which is actually backwards as the training phase comes before the inference phase.
There's too much to cover about the training phase to do it all in detail in just a single essay, so we will only talk broadly about the different components that make up the training phase before going into more detail in later chapters.
In fact, there are multiple ways of training a neural network, but by far the most common method is by using gradient descent with backpropagation which is what we will primarily focus on.
During the training phase you attempt to find a set of weights such that the networks hypothesis
The training algorithm can be described as follows:
- Start with a bad set of weights
- Iteratively work towards a slightly better set of weights
- Repeat until the weights are good enough
But how do we determine what good weights are?
And how do we know how to update the weights to make them better?
In order to determine how bad the network's current hypothesis is, we use an error function,
An error function describes how wrong a model's hypothesis is.
The simplest error function, one which you're probably already familiar with, is mean squared error defined as:
The goal then becomes to find a set of weights that minimizes the value of
The weights are updated using an algorithm called gradient descent which iteratively updates each weight in proportion to their influence on the error function.
Neural Networks learn by iteratively changing the weights to minimize an error function
Individual weight's influence on the error function is determined via backpropagation called so because your propagate back through the network starting with the last layer.
We will examine these algorithms closer in the later chapters.
We have looked at how feed forward neural networks are constructed, and how inference works through forward propagation. We have solidified this knowledge by constructing a small example.
Furthermore, have discussed how we might automate the training process. Which we will look closer at in a later chapter.
And the bias depending on whether you consider the bias weight an actual weight. ↩︎
Sometimes, but rarely, an activation is applied to the input layer too. ↩︎
Historically, the sigmoid activation function was common. It is less so today due to its numerous problematic properties which we discuss later. ↩︎
Gradient descent and algorithms like gradient descent. ↩︎