# Introduction to Anomaly Detection

## Problem Motivation

A very common application of anomaly detection is detecting fraud. How anomaly detection works:

1. Given a dataset, we define a model p(x) that tells us the probability that new data is not anomalous. We use a threshold ϵ (epsilon) to divide the new data into anomalous and not anomalous
2. If your anomaly detector is flagging up too many anomalous example, it might be a good idea to decrease the threshold

In the application of detecting fraud:

• x(i) = features of user i’s activities
• Model p(x) from the data
• Identify unusual users by checking which have p(x)<ϵ

## Gaussian Distribution (re-visit)

The Gaussian distribution is a familiar bell-shaped curve that can be described by a function $\mathcal{N}(\mu,\sigma^2)$. The Gaussian distribution is parameterised by mean and variance whereby the mean describes the center of the curve and sigma describe the width of the curve. The full function is as follows: $\large p(x;\mu,\sigma^2) = \dfrac{1}{\sigma\sqrt{(2\pi)}}e^{-\dfrac{1}{2}(\dfrac{x - \mu}{\sigma})^2}$

Formula reminder: $\sigma^2 = \dfrac{1}{m}\displaystyle \sum_{i=1}^m(x^{(i)} - \mu)^2$

## Algorithm

Given a training set of examples: $p(x) = p(x_1;\mu_1,\sigma_1^2)p(x_2;\mu_2,\sigma^2_2)\cdots p(x_n;\mu_n,\sigma^2_n)$ $= \displaystyle \prod^n_{j=1} p(x_j;\mu_j,\sigma_j^2)$

This is called an independence assumption on the values of the features inside training example x.

#### Implementation

1. Choose features x(i) that you think might be indicative of anomalous examples
2. Fit parameters $\mu_1,\dots,\mu_n,\sigma_1^2,\dots,\sigma_n^2$
3. Calculate the mean (vectorised)
4. Calculate the variance (vectorised)
5. Given a new example x, compute p(x). If p(x) is less than epsilon, it’s an anomaly

# Developing and Evaluating an Anomaly Detection System

Firstly, we need to take some labeled data and categorised the data into anomalous and non-anomalous examples (y = 0 if normal, y = 1 if anomalous). Among the data, take a large proportion of the normal data for training set on which to train p(x) model. We will then use a smaller proportion of mixed anomalous and normal examples for cross-validation and test sets. We aim to split the data 60/20/20 training/validation/test split and then split the anomalous examples 50/50 between cross validation and test sets. Therefore we have the following:

1. Fit model p(x) on training set {x(1),…..,x(m)}
2. On a cross validation/test example x, predict:
• If p(x) < ϵ (anomaly), then y=1
• If p(x) ≥ ϵ (normal), then y=0

Classification accuracy would not be a good evaluation metrics because of the skewness in the dataset. Possible evaluation metrics include Precision/Recall, F1 score and etc…

## Anomaly Detection vs. Supervised Learning

We use anomaly detection when:

• We have a very small number of positive examples (y=1 (anomaly) … 0-20 examples is common) and a large number of negative (y=0 (normal)) examples
• We have many different “types” of anomalies and it is hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far

We use supervised learning when:

• We have a large number of both positive and negative examples. In other words, the training set is more evenly divided into classes
• We have enough positive examples for the algorithm to get a sense of what new positives examples look like. The future positive examples are likely similar to the ones in the training set

## Choosing What Features to Use

The features we use will greatly affect how well our anomaly detection algorithm works. We can check that our features are gaussian by plotting a histogram and checking for the bell-shaped curve. If you don’t have a bell-shaped curve, you can try the following transformations to achieve the gaussian shape:

• log(x)
• log(x+c) for some constant
• sqrt(x)
• x^(1/3)

Our objective is for p(x) to be large for normal examples and small for anomalous examples. One issue is when p(x) is similar for both types of examples. This means that you need to examine the anomalous examples that are giving high probability in detail and try to figure out new features that will better distinguish the data. In general, choose features that might take on unusually large or small values in the event of an anomaly.

# Recommender Systems

We experience recommender systems in our everyday activities, ranging from recommended Facebook ads to Netflix recommended movies for us to watch. Let’s say we are building a recommender system to recommend movies to customers. We will have the following variables:

• $n_u =$ number of users
• $m_u =$ number of movies
• r(i,j) = 1 if user j has rated movie i
• y(i,j) = rating given by user j to movie i (if r(i,j) = 1)

## Content Based Recommendations

Let’s assume two features for a movie whereby x(1) represents how romantic a movie is and x(2) represents how much action a movie has (on a scale 0 -1). We could do a linear regression for every user. For user j, movie i, the predicted rating is as follows: $(\theta^{(j)})^T(x^{(i)})$

where theta(j) is the parameter vector for user j and x(i) is the feature vector for movie i. To compute theta(j), we do the following: $min_{\theta^{(j)}} = \dfrac{1}{2}\displaystyle \sum_{i:r(i,j)=1} ((\theta^{(j)})^T(x^{(i)}) - y^{(i,j)})^2 + \dfrac{\lambda}{2} \sum_{k=1}^n(\theta_k^{(j)})^2$

where the first summation is choosing all the movies whereby r(i,j) = 1. To get the parameters for all our users, we have the following expressions: $min_{\theta^{(1)},\dots,\theta^{(n_u)}} = \dfrac{1}{2}\displaystyle \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1} ((\theta^{(j)})^T(x^{(i)}) - y^{(i,j)})^2 + \dfrac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^n(\theta_k^{(j)})^2$

We can apply our linear regression gradient descent update using the above cost function! Below is an example of content-based recommender system: ## Collaborative Filtering

It is a difficult task to quantify “amount of romance” or “amount of action” in a movie. To figure this out, we can use feature finders. We can ask the users to tell us how much they like the different genres (providing their parameter vector to us). To infer the features from given parameters, we have the following expression: $min_{x^{(1)},\dots,x^{(n_m)}} \dfrac{1}{2} \displaystyle \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2}\sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2$

You can also randomly guess the values for theta to guess the features repeatedly. You will actually converge to a good set of features!

#### Algorithm

We can simultaneously minimise the features and parameters by combining the cost function for theta and x. $J(x,\theta) = \dfrac{1}{2} \displaystyle \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2}\sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 + \dfrac{\lambda}{2}\sum_{j=1}^{n_u} \sum_{k=1}^{n} (\theta_k^{(j)})^2$

1. Initialise sets of x and theta values to small random values. This will break the symmetry and ensures that the algorithm learns features x(i),…,x(n_m) that are different from each other
2. Minimise cost function above using either the gradient descent or any advanced optimisation algorithms. For every j (user) = 1,…,n_u and every i (movie) = 1,…,n_m:
• $x_k^{(i)} := x_k^{(i)} - \alpha\left (\displaystyle \sum_{j:r(i,j)=1}{((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) \theta_k^{(j)}} + \lambda x_k^{(i)} \right)$
• $\theta_k^{(j)} := \theta_k^{(j)} - \alpha\left (\displaystyle \sum_{i:r(i,j)=1}{((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) x_k^{(i)}} + \lambda \theta_k^{(j)} \right)$
3. For a user with parameters theta and a movie with features x, predict a star rating of theta^(T)x

## Low Rank Matrix Factorisation

#### Vectorisation

Given matrix X, which each row contains features of a particular movie and matrix Θ, which each row contains the weights for those features for a given user, the full matrix Y of all predicted ratings of all movies by all users is given by Y = X * Θ^T. We can easily predict how similar two movies i and j by using the distance between their respective feature vectors x. Specifically, we are looking for a small value of ||x(i) – x(j)||. #### Mean normalisation

One final addition to our ranking system for movies is to normalise the data relative to the mean. The reason we do this is because given our current ranking system, a new user (who has watched no movies) will be assigned θ will all components equal to zero due to minimisation of the regularisation term. This doesn’t seem intuitively correct and new user will be assigned to new movies incorrectly.

Firstly, we use a matrix Y to store the data from previous ratings, where the ith row of Y is the ratings for the ith movie and the jth column corresponds to the ratings for the jth user. We can define a vector $\mu = [\mu_1, \mu_2, \dots , \mu_{n_m}]$ such that: $\mu_i = \frac{\sum_{j:r(i,j)=1}{Y_{i,j}}}{\sum_{j}{r(i,j)}}$

This is effectively the mean of the previous ratings for the ith movie. We can now normalise the data by subtracting $\mu$, the mean rating, from the actual ratings for each user (column in matrix Y). Please see below an example: 