Supervised Learning: Support Vector Machines

Introduction to Support Vector Machines

Support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyse data and recognise patterns, used for both classification and regression analysis. Given a set of training examples, each marked for belonging to one of the two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

Here’s the basic intuition behind SVMs. Imagine the labeled training data below. As shown below, we have two classes, a red square class and a blue circle class. The objective here is that when we insert a new data point, we want to know if it belongs to the red square class or the blue circle class (binary classification). Intuitively, we can draw a separating “hyperplane” between the classes. However, as shown in the first graph, we have lots of hyperplanes that separate the two classes perfectly. Therefore, the question lies in how do we choose the hyperplane that optimally separates these two classes. We would want to choose a hyperplane that maximises the margin between the classes as shown in the second graph. The vector points that the margin lines touch (filled with the corresponding class colours) are known as Support Vectors.

What if we are dealing with a non-linearly separable dataset as shown below (left graph)? We can expand the SVMs through the “kernel trick” where we view the same dataset in a higher dimension (3rd Z label). As shown on the right graph below, now the dataset is separable.

Case study

We’ll be using Support Vector Machines to predict whether a tumor is malignant or benign

1. Data Preparation

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Scikit learn has its own datasets you can import and in this case, we have imported the breast cancer dataset

In [2]:
from sklearn.datasets import load_breast_cancer
In [3]:
cancer = load_breast_cancer()
In [4]:
cancer.keys()
Out[4]:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

We can grab information and arrays out of this dictionary to set up our data frame and understanding of the features:

  • DESCR: gives you the description of the datasets
  • data: your dataset
  • feature_names: variable names
  • target: dummy variable (0,1)
  • target_names: Malignant or Benign
In [8]:
df_feat = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
df_feat.head(2)
Out[8]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 24.99 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902

2 rows × 30 columns

In [10]:
cancer['target_names']
Out[10]:
array(['malignant', 'benign'],
      dtype='<U9')

2. Train the Support Vector Classifier

Splitting data

In [11]:
from sklearn.model_selection import train_test_split
In [12]:
X = df_feat
y = cancer['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=101)

Training

In [13]:
from sklearn.svm import SVC
In [14]:
model = SVC()
In [15]:
model.fit(X_train,y_train)
Out[15]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Predictions and Evaluations

In [16]:
pred = model.predict(X_test)
In [17]:
from sklearn.metrics import classification_report, confusion_matrix
In [18]:
print(confusion_matrix(y_test, pred))
[[  0  66]
 [  0 105]]
In [19]:
print(classification_report(y_test,pred))
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        66
          1       0.61      1.00      0.76       105

avg / total       0.38      0.61      0.47       171

/Users/ryanong/anaconda3/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Notice that we are classifying everything into a single class! This means our model needs to have it parameters adjusted (it may also help to normalize the data). We can search for parameters using a GridSearch method

GridSearch

Finding the right parameters (like what C or gamma values to use) is a tricky task! We can try a bunch of combinations and see what works best! This idea of creating a ‘grid’ of parameters and just trying out all the possible combinations is called a Gridsearch. Scikit-learn has this functionality built in with GridSearchCV! (CV = cross validation)

GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested.

In [20]:
from sklearn.grid_search import GridSearchCV

One of the great things about GridSearchCV is that it is a meta-estimator. It takes an estimator like SVC, and creates a new estimator, that behaves exactly the same – in this case, like a classifier. You should add refit=True and choose verbose to whatever number you want, higher the number, the more verbose (verbose just means the text output describing the process).

  • C controls the costs of misclassification on the training data – high C value gives you low bias and high variance as there is a high penalisation for misclassification
  • Large gamma will lead to high bias and low variance
In [21]:
# setting the value or C and gamma to feed into GridSearchCV
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}
In [23]:
grid = GridSearchCV(SVC(),param_grid,verbose=3)

The fit method: First, it runs the same loop with cross-validation, to find the best parameter combination. Once it has the best combination, it runs fit again on all data passed to fit (without cross-validation), to built a single new model using the best parameter setting.

In [25]:
grid.fit(X_train,y_train)
Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ............. C=0.1, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ............. C=0.1, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ............. C=0.1, gamma=1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........... C=0.1, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........... C=0.1, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........... C=0.1, gamma=0.1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] .......... C=0.1, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] .......... C=0.1, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] .......... C=0.1, gamma=0.01, kernel=rbf, score=0.636364 -   0.0s
[CV] C=0.1, gamma=0.001, kernel=rbf ..................................
[CV] ......... C=0.1, gamma=0.001, kernel=rbf, score=0.631579 -   0.0s
[CV] C=0.1, gamma=0.001, kernel=rbf ..................................
[CV] ......... C=0.1, gamma=0.001, kernel=rbf, score=0.631579 -   0.0s
[CV] C=0.1, gamma=0.001, kernel=rbf ..................................
[CV] ......... C=0.1, gamma=0.001, kernel=rbf, score=0.636364 -   0.0s
[CV] C=0.1, gamma=0.0001, kernel=rbf .................................
[CV] ........ C=0.1, gamma=0.0001, kernel=rbf, score=0.902256 -   0.0s
[CV] C=0.1, gamma=0.0001, kernel=rbf .................................
[CV] ........ C=0.1, gamma=0.0001, kernel=rbf, score=0.962406 -   0.0s
[CV] C=0.1, gamma=0.0001, kernel=rbf .................................
[CV] ........ C=0.1, gamma=0.0001, kernel=rbf, score=0.916667 -   0.0s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ............... C=1, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ............... C=1, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ............... C=1, gamma=1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=1, gamma=0.1, kernel=rbf ......................................
[CV] ............. C=1, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1, gamma=0.1, kernel=rbf ......................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[CV] ............. C=1, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1, gamma=0.1, kernel=rbf ......................................
[CV] ............. C=1, gamma=0.1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=1, gamma=0.01, kernel=rbf .....................................
[CV] ............ C=1, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1, gamma=0.01, kernel=rbf .....................................
[CV] ............ C=1, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1, gamma=0.01, kernel=rbf .....................................
[CV] ............ C=1, gamma=0.01, kernel=rbf, score=0.636364 -   0.0s
[CV] C=1, gamma=0.001, kernel=rbf ....................................
[CV] ........... C=1, gamma=0.001, kernel=rbf, score=0.902256 -   0.0s
[CV] C=1, gamma=0.001, kernel=rbf ....................................
[CV] ........... C=1, gamma=0.001, kernel=rbf, score=0.939850 -   0.0s
[CV] C=1, gamma=0.001, kernel=rbf ....................................
[CV] ........... C=1, gamma=0.001, kernel=rbf, score=0.954545 -   0.0s
[CV] C=1, gamma=0.0001, kernel=rbf ...................................
[CV] .......... C=1, gamma=0.0001, kernel=rbf, score=0.939850 -   0.0s
[CV] C=1, gamma=0.0001, kernel=rbf ...................................
[CV] .......... C=1, gamma=0.0001, kernel=rbf, score=0.969925 -   0.0s
[CV] C=1, gamma=0.0001, kernel=rbf ...................................
[CV] .......... C=1, gamma=0.0001, kernel=rbf, score=0.946970 -   0.0s
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] .............. C=10, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] .............. C=10, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] .............. C=10, gamma=1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=10, gamma=0.1, kernel=rbf .....................................
[CV] ............ C=10, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=10, gamma=0.1, kernel=rbf .....................................
[CV] ............ C=10, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=10, gamma=0.1, kernel=rbf .....................................
[CV] ............ C=10, gamma=0.1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=10, gamma=0.01, kernel=rbf ....................................
[CV] ........... C=10, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=10, gamma=0.01, kernel=rbf ....................................
[CV] ........... C=10, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=10, gamma=0.01, kernel=rbf ....................................
[CV] ........... C=10, gamma=0.01, kernel=rbf, score=0.636364 -   0.0s
[CV] C=10, gamma=0.001, kernel=rbf ...................................
[CV] .......... C=10, gamma=0.001, kernel=rbf, score=0.894737 -   0.0s
[CV] C=10, gamma=0.001, kernel=rbf ...................................
[CV] .......... C=10, gamma=0.001, kernel=rbf, score=0.932331 -   0.0s
[CV] C=10, gamma=0.001, kernel=rbf ...................................
[CV] .......... C=10, gamma=0.001, kernel=rbf, score=0.916667 -   0.0s
[CV] C=10, gamma=0.0001, kernel=rbf ..................................
[CV] ......... C=10, gamma=0.0001, kernel=rbf, score=0.932331 -   0.0s
[CV] C=10, gamma=0.0001, kernel=rbf ..................................
[CV] ......... C=10, gamma=0.0001, kernel=rbf, score=0.969925 -   0.0s
[CV] C=10, gamma=0.0001, kernel=rbf ..................................
[CV] ......... C=10, gamma=0.0001, kernel=rbf, score=0.962121 -   0.0s
[CV] C=100, gamma=1, kernel=rbf ......................................
[CV] ............. C=100, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=100, gamma=1, kernel=rbf ......................................
[CV] ............. C=100, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=100, gamma=1, kernel=rbf ......................................
[CV] ............. C=100, gamma=1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=100, gamma=0.1, kernel=rbf ....................................
[CV] ........... C=100, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=100, gamma=0.1, kernel=rbf ....................................
[CV] ........... C=100, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=100, gamma=0.1, kernel=rbf ....................................
[CV] ........... C=100, gamma=0.1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=100, gamma=0.01, kernel=rbf ...................................
[CV] .......... C=100, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=100, gamma=0.01, kernel=rbf ...................................
[CV] .......... C=100, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=100, gamma=0.01, kernel=rbf ...................................
[CV] .......... C=100, gamma=0.01, kernel=rbf, score=0.636364 -   0.0s
[CV] C=100, gamma=0.001, kernel=rbf ..................................
[CV] ......... C=100, gamma=0.001, kernel=rbf, score=0.894737 -   0.0s
[CV] C=100, gamma=0.001, kernel=rbf ..................................
[CV] ......... C=100, gamma=0.001, kernel=rbf, score=0.932331 -   0.0s
[CV] C=100, gamma=0.001, kernel=rbf ..................................
[CV] ......... C=100, gamma=0.001, kernel=rbf, score=0.916667 -   0.0s
[CV] C=100, gamma=0.0001, kernel=rbf .................................
[CV] ........ C=100, gamma=0.0001, kernel=rbf, score=0.917293 -   0.0s
[CV] C=100, gamma=0.0001, kernel=rbf .................................
[CV] ........ C=100, gamma=0.0001, kernel=rbf, score=0.977444 -   0.0s
[CV] C=100, gamma=0.0001, kernel=rbf .................................
[CV] ........ C=100, gamma=0.0001, kernel=rbf, score=0.939394 -   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV] ............ C=1000, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV] ............ C=1000, gamma=1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV] ............ C=1000, gamma=1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV] .......... C=1000, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV] .......... C=1000, gamma=0.1, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV] .......... C=1000, gamma=0.1, kernel=rbf, score=0.636364 -   0.0s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV] ......... C=1000, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV] ......... C=1000, gamma=0.01, kernel=rbf, score=0.631579 -   0.0s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV] ......... C=1000, gamma=0.01, kernel=rbf, score=0.636364 -   0.0s
[CV] C=1000, gamma=0.001, kernel=rbf .................................
[CV] ........ C=1000, gamma=0.001, kernel=rbf, score=0.894737 -   0.0s
[CV] C=1000, gamma=0.001, kernel=rbf .................................
[CV] ........ C=1000, gamma=0.001, kernel=rbf, score=0.932331 -   0.0s
[CV] C=1000, gamma=0.001, kernel=rbf .................................
[CV] ........ C=1000, gamma=0.001, kernel=rbf, score=0.916667 -   0.0s
[CV] C=1000, gamma=0.0001, kernel=rbf ................................
[CV] ....... C=1000, gamma=0.0001, kernel=rbf, score=0.909774 -   0.0s
[CV] C=1000, gamma=0.0001, kernel=rbf ................................
[CV] ....... C=1000, gamma=0.0001, kernel=rbf, score=0.969925 -   0.0s
[CV] C=1000, gamma=0.0001, kernel=rbf ................................
[CV] ....... C=1000, gamma=0.0001, kernel=rbf, score=0.931818 -   0.0s
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.8s finished
Out[25]:
GridSearchCV(cv=None, error_score='raise',
estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
fit_params={}, iid=True, n_jobs=1,
param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

To return the best parameters and estimators for the new model

In [26]:
grid.best_params_
Out[26]:
{'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
In [27]:
grid.best_estimator_
Out[27]:
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

Rerun predictions

In [28]:
grid_pred = grid.predict(X_test)
In [29]:
print(confusion_matrix(y_test,grid_pred))
print(classification_report(y_test,grid_pred))
[[ 60   6]
[  3 102]]
precision    recall  f1-score   support
0       0.95      0.91      0.93        66
1       0.94      0.97      0.96       105
avg / total       0.95      0.95      0.95       171

Leave a Reply

Your email address will not be published. Required fields are marked *