So far, we’ve been working with data that can fit on our local computer, in the scale of 0 – 8 GB. However, when our projects start to use larger datasets, it would be a good idea to start using a distributed system, whereby one computer distributes the large dataset to multiple machines/computers. A local process will use the computation resources of a single machine whereas a distributed process has access to the computational resources across a number of machines connected through a network. Distributed machines have the advantage of easily scaling whereby you can just add more machines. They also include fault tolerance whereby if one machine fails, the whole network can still continue!
Hadoop uses the Hadoop Distributed File System (HDFS) to distribute very large files across multiple machines. HDFS allows a user to work with large datasets and it also duplicates blocks of data for fault tolerance. It also uses MapReduce which allows computations on the data. The graph below shows a typical format of a distributed storage that uses HDFS:
HDFS will use blocks of data, usually with a size of 128MB by default. Each of these blocks is replicated 3 times and the blocks are distributed in a way to support fault tolerance. Smaller blocks provide more parallelisation during processing. Multiple copies of a block prevent loss of data due to a failure of a node. MapReduce is a way of splitting a computation task to a distributed set of files (such as HDFS). It consists of a Job Tracker and multiple Task Trackers. The Job Tracker sends code to run on the Task Trackers. The Task Trackers allocate CPU and memory for the tasks and monitor the tasks on the worker nodes. Overall, when dealing with big data, we use HDFS to distribute large datasets and use MapReduce to distribute a computational task to a distributed dataset.
Spark is one of the latest technologies use to quickly and easily handle Big Data. It was created at the AMPLab at UC Berkeley. You can view Spark as a flexible alternative to MapReduce whereby Spark can use data stored in a variety of formats such as Cassandra, AWS S3, HDFS and etc… Spark can also perform operations up to 100x faster than MapReduce. This is due to the fact that MapReduce writes most data to disk after each map and reduce operation whereas Spark keeps most of the data in memory after each transformation. Spark can spill over to disk if memory is full.
At the core of Spark is the idea of a Resilient Distributed Dataset (RDD). RDD has 4 main features:
- Distributed Collection of Data
- Parallel operation – partioned
- Ability to use many data sources
There are two types of RDD operations; Transformations and Actions. Basic Actions include:
- First: return the first element in the RDD
- Collect: return all the elements in the RDD as an array at the driver program
- Count: return the number of elements in the RDD
- Take: return an array with the first n elements in the RDD
Basic Transformations include:
- Filter: applies a function to each element and returns elements that evaluate to true
- Map: transforms each element and preserves number of elements, similar to pandas .apply()
- FlatMap: transforms each element into 0-N elements and changes number of elements
- Reduce: aggregate RDD elements using a function that returns a single element
- ReduceByKey: aggregate Pair RDD elements using a function that returns a Pair RDD
Difference between Map and FlatMap: An example would be that Map grabs the first letter of a list of names whereas FlatMap transform a corpus of text into a list of words.
Very often, RDDs will be holding their values in tuples (key, value). This offers better partitioning of data and leads to functionality based on reduction. Reduce and ReduceByKey are similar to a GroupBy operation. The Spark Ecosystem now includes Spark SQL, Spark DataFrames, MLlib, GraphX and Spark Streaming. In the course we will be using the Amazon Elastic Compute Cloud (Amazon EC2). Amazon EC2 is a web service that provides resizable compute capacity in the cloud (basically a virtual computer that we can access through the internet).
Two common ways to create a RDD:
from pyspark import SparkContext
sc = SparkContext()
- sc.parallelise(array) – Create RDD of elements of an array
- sc.textFile(path-to-file) – Create RDD of lines from file (assign it to a variable – in the example below, let’s use text)
Examples of RDD Transformation:
- text.filter(lambda x: x % 2 == 0) – Filter out non-even elements
- text.map(lambda x: x * 2) – Multiply each RDD element by 2
- text.flatMap(lambda x: x.split()) – Split each strings into words and flatten sequence
- text.sortBy(lambda x: x, ascending = False) – Sort elements in descending order
Examples of RDD Actions:
- collect() – Convert RDD to in-memory list
- take(3) – First 3 elements of RDD
- mean() – Find element mean