Gradient Descent: How Machines Learn

The bread and butter of neural networks

Okay so you have an objective function which you want to optimize. But how do you find the extremum of the function?

Meet gradient descent.

Gradient descent is an iterative algorithm that slowly finds a local minimum of a function. The algorithm is a first-order optimization method since it uses the first-derivative - the gradient of the function.

It works by finding the gradient of the parameters at each step and updating them a little bit in a direction anti-parallel to the gradient (and parallel if maximizing the function - called gradient ascent).

Formulation

Let $f(x;\theta)$ be the objective function parameterized by $\theta$ which we want to minimize.

$$\theta_n = \theta_{n-1} - \alpha \frac{d}{d\theta} f(\theta_{n-1})$$

where $\alpha$ is a parameter controlling how fast we update the parameters usually a low number of the order of $10^{-3}$.

The algorithm assumes that the function $f$ is continuous which is not an unreasonable requirement, and it will find the global minimum if the function is convex. However, in the real world, many functions are not neat and convex, so the algorithm will only find a local minimum which can cause problems if you need the global minimum (which is a much harder problem). Luckily, for properly designed deep neural networks, a local minimum is often good enough, and there are some arguments that the global minimum will severely overfit the data since models are very expressive and data is noisy.

Intuition

To better understand why this algorithm works, it can be helpful to visualize what happens at every step.

Function:

Initial guess: (The initial xx-value x0x_0)

α\alpha:

Max iterations:

Using the interactive visualization above, we can see that gradient descent works similarly to "placing a ball on the curve" and let it roll down the curve to the lowest point. Moreover, we see that it is sensitive to where we place the "ball". For example, if we place the ball at $2$ with the polynomial function, then it won't reach the global minimum.

We also see that the ball moves more quickly when the descent is steeper, and slows down as it reaches its destination and the curve is flatter. This is expected as the amount we update the parameters is proportional to the gradient.