In order to motivate why we separate the training and inference phases of neural networks, it can be useful to analyse the computational complexity.
This essay assumes familiarity with analytical complexity analysis of algorithms, and hereunder big-O notation. If you need a recap, you should read the essay on computational complexity before continuing.
Forward propagation
Looking at inference part of a feed forward neural network, we have forward propagation.
Finding the asymptotic complexity of the forward propagation procedure can be done much like we how we found the run-time complexity of matrix multiplication.
Before beginning, you should be familiar with the forward propagation procedure.
We assume the input-vector can be described as:
$$x \in \mathbb{R}^{n}$$
Where the first element is the bias unit: $x_0 = 1$.
When implementing neural networks, it's often the case that all the samples are collected into a matrix with the dimensions $x \in \mathbb{R}^{\eta \times n}$ where $\eta$ is the total number of samples in the trainingset. This is done to save time on allocating memory, and is primarily a practical problem which is why we won't consider it further in the theory.
The input is treated in the same as any other activation matrix, and has the index: $x=a^{(0)}$. The zeroth element, $a_0^{(0)}$ is as usual the bias unit with a value of 1.
About forward propagation, we can write:
$$z^{(k)} = \theta^{(k)} a^{(k-1)}$$
$$z^{(k)} \in \mathbb{R}^{1 \times (m | \theta^{(k)}\in \mathbb{R}^{m \times n})}$$
$$a^{(k)} = g\big( z^{(k)} \big)$$
Where $g(x)$ is the activation function which is evaluated elementwise. We therefore know that $a^{(k)}$ has the same dimensions as $z^{(k)}$.
We see that for each layer a matrix multiplication, and an activation function is computed. We know from the the previous essay that naive matrix multiplication has a asymptotic run-time of $O(n^3)$, and since $g(x)$ is an elementwise function, we know that it has a run-time of $O(n)$.
By analysing the dimensions of a feed forward neural network, we find:
$$\theta^{(0)} \in \mathbb{R}^{n^{(0)} \times 1}$$
$$\theta^{(1)} \in \mathbb{R}^{n^{(1)} \times n^{(0)}} $$
$$\theta^{(2)} \in \mathbb{R}^{n^{(2)} \times n^{(1)}}$$
More generally, we can write:
$$\theta^{(k)}= \begin{cases} \mathbb{R}^{n^{(k)}\times 1} \quad &\text{if} \space k=0\\ \mathbb{R}^{n^{(k)}\times n^{(k-1)}} \quad &\text{if} \space k>0 \end{cases}$$
Where $n^{(k)}$ is the number of neurons including the bias unit in layer $k$.
Recalling $zad$ from the previous post, we see that:
$$ \begin{aligned} a &= n^{(k)} \\ z &= n^{(k-1)} \\ d &= n^{(k-2)} \end{aligned} $$
From this we find that:
$$\begin{aligned} n_{mul} &= \sum_{k=2}^{n_{layers}} \big( n^{(k)}n^{(k-1)}n^{(k-2)} \big) + \big( n^{(1)}n^{(0)}1 \big) \\ n_{g} &= \sum_{k=1}^{n_{layers}} \big( n^{(k)} \big) \\ \end{aligned} $$
Where $n_{mul}$ is the number of multiplications performed, and $n_g$ is how many times we apply the activation function, $g$.
It should be noted, however, that we have rather naively assumed that $a^{(k)}$ has the same dimension as $\theta^{(k)}$ which is obviously not the case as $a^{(k)}\in \mathbb{R}^{n^{(k)}}$. (Thanks Chuanting Zhang for suggesting making this assumption more clear.)
This gives us:
$$ \begin{aligned} \text{time} &= n_{mul} + n_{g} \Leftrightarrow \\ \text{time} &= \sum_{k=2}^{n_{layers}} \big( n^{(k)}n^{(k-1)}n^{(k-2)} \big) + \big( n^{(1)}n^{(0)}1 \big) + \sum_{k=1}^{n_{layers}} \big( n^{(k)} \big) \end{aligned} $$
When analysing matrix algorithms, it's common to assume that the matrices are quadratic; that is they have the same number of rows, as columns. By doing this, we find that:
$$n_{mul} = n_{layers} \cdot n^{3}$$
If we once again assume that there are the same number of neurons in each layer, and that the number of layers equal the number of neurons in each layer we find:
$$n_{mul} = O(n\cdot n^3) = O(n^4)$$
The same can be done for the activations:
$$n_g = n_{layers} \cdot n = O(n^2)$$
The total run-time therefore becomes:
$$O(n^4+n^2) \Leftrightarrow$$
$$O(n^4) \because \forall n \geq 1 \space |\space n^4+n^2 \leq 2n^4$$
Backpropagation
We can find the run-time complexity of backpropagation in a similar manner.
Before beginning, you should be familiar with the [backpropagation procedure](https://kasperfred.com/series/introduction-to-neural-networks/how-does-backpropagation-work).
We can safely ignore $\text{time}_{\nabla_a}$ as it will be in the order of 1:
$$ \text{time}_{\nabla_a} = k$$
This gives us:
$$\text{time}_{\text{error}} =\begin{cases}\alpha \cdot \beta \space | \space \nabla_a^{(L)} \in \mathbb{R}^{\alpha \times \beta} \quad \text{if} \space k=L\\ \big( \theta^{(k+1)^T} \delta^{(k+1)} \big) \odot g'(z^{(k)}) \end{cases} $$
If we assume that there are $n$ neurons in each layer, we find that:
$$\begin{aligned}&n^2 = O\big( \alpha \cdot \beta \space | \space \nabla_a^{(L)} \in \mathbb{R}^{\alpha \times \beta} \big) \\&n^3 = O\Big( \big( \theta^{(k+1)^T} \delta^{(k+1)} \big) \odot g'(z^{(k)}) \Big)\end{aligned}$$
The total run-time for the delta error then becomes:
$$O( \text{time}_{\text{error}} ) = n^2 + n^3 (L-1)$$
If we assume there are $n$ layers as we did during forward propagation, we find:
$$O( \text{time}_{\text{error}} ) = n^4$$
In order to find all the weights between a layer, we get:
$$O( \text{time}_{\text{weights}} ) = O(\text{time}_{\text{error}} ) + n^3 = n^4$$
Plugging this into gradient descent we get:
$$\text{time}_{\text{gradient descent}} =n_{\text{gradient iterations}} \cdot\text{time}_{\text{weights}}$$
If we assume $n$ gradient iterations:
$$O(\text{time}_{\text{gradient descent}}) =n \cdot n^4 = n^5$$
So by assuming that gradient descent runs for $n$ iterations, and that there are $n$ layers each with $n$ neurons as we did with forward propagation, we find the total run-time of backpropagation to be:
$$ O(n^5) $$
Discussion
We see that the learning phase (backpropagation) is slower than the inference phase (forward propagation). This is even more pronounced by the fact that gradient descent often has to be repeated many times.
In fact, gradient descent has a convergence rate of $O\big( \frac{1}{\epsilon} \big)$ for a convex function where $\epsilon$ is the error of the final hypothesis.
This results in a large constant factor that has real world consequences, but which big-O doesn't show.
One way of making algorithms run faster is by using parallel execution by forexample running the matrix operations on a GPU.
GPUs are specifically designed to run many matrix operations in parallel since 3D geometry and animation can be expressed as a series of linear transformations.
This is also why we usually train neural networks on GPUs.
Proper Learning
It's worth mentioning that in 1988 Pitt and Valient formulated an argument that if RP $\neq$ NP, which is currently not known, and if it's NP-HARD to differentiate realizable hypotheses from unrealizable hypotheses, then a correct hypothesis $h$ must be NP to find.
This, however, doesn't concur with the complexity we found for backpropagation which is only P.
Whether a function has realizable examples is defined as:
$$\begin{aligned}& h_e \in h \space \text{with an error} \space < \frac{1}{n}\\& \forall i \in \mathbb{N}, [1;n] \space | \space h_e(x_i) = y_i \implies \text{Realizable}\end{aligned}$$
Here we see that a hypothesis is realizable if an experimential hypothesis $h_e$ with an error of less than $\frac{1}{n}$ can correctly guess all the examples within that error-margin.
Here, $h_e$ would be a neural network.
Shalev-Swartz argued against this by classifying neural networks, specifically deep neural networks, as doing improper learning by letting $h_e \notin h$, and argued that $h_e$ can still agree with $h$ even though the searchspace is not realizable. [^Shalev]
[^Shalev]: [Shalev-Swartz's argument](https://arxiv.org/pdf/1311.2272.pdf)
The best run-time for neural networks is an area of active research.
Conclusion
We have derived the computational complexity of a feed forward neural network, and seen why it's attractive to split the computation up in a training and a inference phase since backpropagation, $O(n^5)$, is much slower than the forward propagation, $O(n^4)$.
We have considered the large constant factor of gradient descent required to reach an acceptable accuracy which strengthens the argument.
Furthermore, we have discussed some theoretical aspects of learning representations of functions, and hereunder the role of neural networks, but we were unable to reach a definitive conclusion.