Support Vector Machines

Optimisation Objective

The new cost_1 and cost2_ indicates that we want:

\theta^T x \geq 1 \ if \ y = 1
\theta^T x \leq -1 \ if \ y = 0

Decision boundary

If the value of C is very large (say 100,000), in order to minimise the cost function we would want the value inside the summation term to be very small so that the first term is equal to zero. This only happens given the conditions above!

A support vector machine looks as follows:

where the optimal decision boundary that’s chosen (in this case we have two classes) has maximum margin between the supporting vectors (filled colour shapes). When C is very large, the SVM decision boundary might be very sensitive to outliers. The graph below shows that an additional red cross on the bottom right will cause the black decision boundary line to shift to the purple one (C is very large). However, if C is not too large, the boundary line is more likely to stay as the black line.

Kernels

Kernels allow us to make complex, non-linear classifiers using Support Vector Machines. Firstly, given x, we compute new feature depending on the proximity to landmarks l(1), l(2), l(3). In order to do this, we need to find the “similarity” between x and some landmark:

f_i = similarity(x, l^{(i)}) = \exp(-\dfrac{||x - l^{(i)}||^2}{2\sigma^2})

This specific “similarity” function is called a Gaussian Kernel and it can be rewritten as:

f_i = similarity(x, l^{(i)}) = \exp(-\dfrac{\sum^n_{j=1}(x_j-l_j^{(i)})^2}{2\sigma^2})

If x is roughly equal to the corresponding landmark, then f_i is roughly equal to 1 as the numerator gets closer to zero. Vice versa, if x is far from its corresponding landmark, then f_i would be close to 0. Therefore, if x and the landmark are close, then the similarity will be close to 1, and if x and the landmark are far away from each other, the similarity will be close to 0. Each landmark gives us the features in our hypothesis and σ^2 is a parameter of the Gaussian Kernel that can increase or decrease the drop-off of our feature f_i. Combined with looking at the values inside Θ, we can choose these landmarks to get the general shape of the decision boundary. The effect of σ^2 is shown below:

Choosing landmarks

One method to choose the landmarks is to put them in the exact same locations as all the training examples. This gives us m landmarks:

In order to get the parameters thetas, we can use the SVM algorithm but with f_i instead of x_i:

\min_{\Theta} C \sum_{i=1}^m y^{(i)}\text{cost}_1(\Theta^Tf^{(i)}) + (1 - y^{(i)})\text{cost}_0(\theta^Tf^{(i)}) + \dfrac{1}{2}\sum_{j=1}^n \Theta^2_j

Choosing SVM parameters

We choose two parameters, C and σ^2. If C is large, we might get higher variance/lower bias and vice versa if C is small, we might get lower variance/higher bias. With a large σ^2, the features f_i vary more smoothly, causing higher bias and lower variance. With a small σ^2, the features f_i vary less smoothly, causing lower bias and higher variance.

SVMs in Practice

There are lots of good SWM libraries. Andrew often uses ‘liblinear’ and ‘libsvm’. Many SVM libraries also have multi-class classification built-in.

In practice, you need to make the following choices when implementing SVMs:

  • Parameter C
  • Similarity function (kernel) or no kernel (therefore, standard linear classifier – choose when n is large and when m is small)
  • If you chose Gaussian Kernel (when n is small and m is large), need to choose σ2

Remember to perform feature scaling before using the Gaussian Kernel. Make sure the similarity function satisfy “Mercer’s Theorem” which ensures that the SVM package’s optimisations run correctly and do not diverge. Lastly, you want to train C and the parameters for the kernel function using the training and cross-validation datasets.

Summary

  • If n is large (relative to m) [e.g. n = 10,000, m = 1000] – use logistic regression or SVM without a kernel (the “linear kernel”)
  • If n is small and m is intermediate [e.g. n = 1000, m = 10,000] – use SVM with a Gaussian Kernel
  • If n is small and m is large [e.g. n = 1000, m = 50,000+] – manually create/add more features, then use logistic regression or SVM without a kernel.

In the first case, we don’t have enough examples to use a complicated polynomial hypothesis. In the second example, we have enough examples that we may need a complex non-linear hypothesis. In the last case, we want to increase our features so that logistic regression becomes applicable. Note that a neural network is likely to work well with any of these situations, but may be slower to train.

Leave a Reply

Your email address will not be published. Required fields are marked *