Kernel Functions

A mathematical inquiry into kernel functions and their use.

Kernel functions are a stable in machine learning, and is used in a wide array of algorithms from support vector machines, and principal component analysis to convolutional neural networks.

However, these algorithms are often taught in isolation which can obscure the role of kernel functions which may result in them seeming rather arbitrary.

Therefore, it can be useful to look at kernel functions in isolation.

What are kernel functions?

Kernel functions are a generalization of the vector dot product.

Recall that the dot product between two vectors a\vec{a}, b\vec{b} both in Rn\mathbb{R}^n is given by:

dot(a,b)=i=0naibi \text{dot}(\vec{a}, \vec{b}) = \sum_{i=0}^{n} \vec{a_i} \vec{b_i}

Suppose we have a function f:RnRmf : \mathbb{R}^n \rightarrow \mathbb{R}^m then the dot product of a\vec{a}, and b\vec{b} in the mm-dimensional space is given by:

f(a)Tf(b) f(\vec{a})^Tf(\vec{b})

The value of which is in R\mathbb{R}. A corresponding kernel function, kk can be described as:

k(a,b)=f(a)Tf(b) k(\vec{a},\vec{b}) = f(\vec{a})^Tf(\vec{b})

Notice that kk does not evaluate ff but maps Rn\mathbb{R}^n directly to R\mathbb{R}. This is sometime called the kernel trick, and is what support vector machines use to segment non-linear groups with a linear hyperplane.

Why kernel functions?

Kernel functions can implicitly operate in a high dimensional feature space without ever visiting the space. It turns out that it's often computationally beneficial to operate in the low dimensional feature space.

As it turns out, it's often cheaper to compute kk than summing over all the coordinates in the mm-dimensional feature space.

For example, consider the data on the left.

kernel_data (Thanks wikimedia for the picture)

Ordinarily, we wouldn't be able to separate the data using a linear SVM, however, by using the kernel, k(a,b)=(a,b,a2+b2)k(a,b)=(a,b,a^2+b^2) we effectively transform the data to what's depicted on the right which is linearily separable.

Which we trivially can segment using a linear SVM.


Kernel functions provide a computationally cheap way of computing the dot product of two vectors without explicitly visiting the high dimensional feature space, and when used in the kernel trick, kernel functions can enable linear classifiers to make non-linear classifications.

continue reading

Gradient Descent: How Machines Learn

Okay so you have an objective function which you want to optimize. But how do...

A framework for thinking about risk

How long do you want to live? 80 years? 120 years? 200 years?  1000 year...