Large Scale Machine Learning

Large Datasets

We mainly benefit from a very large dataset when our algorithm has high variance (when m is small). Algorithm with high bias will not benefit from increasing size of dataset. This tell us that before setting up infrastructures to handle large datasets, it would be a good idea to plot the learning curves to make sure that the performance of the algorithm does benefit from large datasets. Datasets can often approach sizes such that m = 100,000,000. This means that our gradient descent step will have to make a summation over all 100 million examples. There are approaches to avoid doing this:

Stochastic Gradient Descent

This is an alternative to the classic batch gradient descent. It is more efficient and scalable to large datasets. The stochastic gradient descent is written out as follows:

cost(\theta,(x^{(i)}, y^{(i)})) = \dfrac{1}{2}(h_{\theta}(x^{(i)}) - y^{(i)})^2

The only difference in the above cost function (in comparison to batch gradient descent) is the elimination of the m constant within 1/2. This means that:

J_{train}(\theta) = \dfrac{1}{m} \displaystyle \sum_{i=1}^m cost(\theta, (x^{(i)}, y^{(i)}))

is now just the average of the cost applied to all of the training examples. The stochastic gradient descent algorithm is as follows:

  1. Randomly shuffle the dataset
  2. For i = 1,….,m
    • \Theta_j := \Theta_j - \alpha (h_{\Theta}(x^{(i)}) - y^{(i)}) \cdot x^{(i)}_j

This algorithm only try to fit one training example at a time. This means that we can make progress in gradient descent without having to scan all m training examples first. However, stochastic gradient descent will unlikely to converge at the global minimum. It will instead wonder around it randomly but usually yields a result that is close enough. Stochastic gradient descent will usually take 1-10 passes through your data set to get near the global minimum.

Convergence

How do we choose the learning rate alpha? How to we debug the algorithm to make sure it is getting as close to the global optimum as possible?

One strategy is to plot the average cost of the hypothesis applied to every 1000 or so training examples. We can compute and save these costs into an array during the gradient descent iterations. With a smaller learning rate, it is possible to obtain a slightly better solution with stochastic gradient descent. This is because stochastic gradient descent will oscillate and jump around the global minimum and it will make smaller random jumps with a smaller learning rate. If you increase number of examples you average over to plot the performance, the line will become smoother. With a very small examples for the average, the line will be too noisy, therefore, difficult to find the trend.

One strategy for trying to actually converge at the global minimum is to slowly decrease α over time. For example:

\alpha = \dfrac{const1}{iterationNumber + const2}

However, this is often not implemented because people don’t want to have to fiddle with even more parameters.

Mini-Batch Gradient Descent

This alternative can sometimes be even faster than stochastic gradient descent. Instead of using all m examples or only 1 example, this gradient descent will use some in-between number of examples b (mini-batch size). For example, with b = 10 and m = 1000, we will repeat the following:

For i = 1, 11, 21, 31,…., 991

\theta_j := \theta_j - \alpha \dfrac{1}{10} \displaystyle \sum_{k=i}^{i+9} (h_\theta(x^{(k)}) - y^{(k)})x_j^{(k)}

In this case, we are summing over ten examples at a time. The advantage of computing more than one example at a time is that we can use vectorised implementations over the b examples.

Online learning

With a continuous stream of users to a website, we can run an endless loop that gets (x,y), where we collect some user actions for the features in x to predict some behaviour y. You can update θ for each individual (x,y) pair as you collect them. This way, you can adapt to new pools of users, since you are continuously updating theta.

MapReduce and Data Parallelism

When dealing with overly large datasets, it is common to divide up batch gradient descent and dispatch the cost function for a subset of the data to many different machines (computers) so that we can train our algorithm in parallel. We can split our training set into z subsets corresponding to the number of machines we have. Each machines will calculate

\displaystyle \sum_{i=p}^{q}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}

where we have split the data starting at p and ending at q. MapReduce will then take all these dispatched jobs and ‘reduce’ them by calculating:

\Theta_j := \Theta_j - \alpha \dfrac{1}{m}(temp_j^{(1)} + temp_j^{(2)} + \cdots + temp_j^{(z)})

for all j = 0,….,n. The expression above simply takes the computed cost from all the machines, calculate the average, multiply it by the learning rate and updating theta.

Your learning algorithm is MapReducable if it can be expressed as computing sums of functions over the training set. Linear and logistic regression are easily parallelisable. For neural networks, you can compute forward propagation and back propagation on subsets of your data on many machines. Those machines can report their derivatives back to a ‘master’ server that will combine them.

Leave a Reply

Your email address will not be published. Required fields are marked *