Google search- you use it constantly, but do you know the tech backstory? In the mid-1990’s Larry Page and Sergey Brin’s creation was one of about 20 different search engines at the time when it introduced its’ version of MapReduce to do increasingly larger and larger parallel processing jobs giving it advantages in terms of index page speeds along side its’ now famous PageRank algorithm which ultimately helped Google enjoy the dominance it has today.
When we push data (in this case keywords) into a distributed file system like GFS (now Colossus) or HDFS, it uses a process known as MapReduce that actually does something with that data! MapReduce is a divide & conquer process common in Hadoop designs that splits data into manageable pieces, and then performs further sorting functions on it to produce a desired output (i.e. from keywords to PageRank results).
Here’s a really simple example- plug in these individual numbers: 3, 4, 5, 10, 12, 16. The sum of this “input” can be “reduced” to a single “output” of 50. In the below example keywords are sorted and reduced to a base number. At a high level MapReduce consists of 6 basic steps, but the core ones are listed below:
1) Map Phase: data is split amongst nodes, with tasks being assigned to nodes where the data for that task is located. Each node in the map phase emits key‐value pairs based on input one record at a time.
2) Shuffle phase: transfers & mergers results. Output from the mappers is sent to the reducers as partitions.
3) Reduce Phase: each reducer gets one or more partitions, and reads a key and repeatable list of values associated with that key. Reducers return zero or more key‐value pairs based on Google’s application logic. In this simple example the keywords have just been reduced to the sum of the inputs.
In this next example, Google’s MapReduce algorithm is performing word counts on a document. The map method takes an input key and a value, where the key represents a document name and the value is its’ associated content.
The algorithm loops through “a word” on the site, and returns the values: word (w) and count (1). The reduce method takes as input a key (that keyword) and a list of values (v). The list of values is simply the list of counts for that word on the site. In this example, the value is just “1”. The reduce method loops through the counts and just sums them. When the loop is done, the reduce method returns the keyword and it’s total count.
Of course there is a volumes of additional Google search architecture and processes to explore- and Google has moved on from MapReduce for calculating search indexes years ago, but you just read the basics. Now imagine a method sorting through millions of websites hunting for millions more keywords and values a minute and you can begin to imagine how scalable and efficient this divide & conquer architecture needed to be.
Even though Google was a late comer to the search engine war, they won the battles in large part by developing a scale and performance sensitive Hadoop design nearly a decade before Doug Cutting gave that architecture a name. So without even realizing it at the time, their search engineers (amongst them: Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, and Jeffrey Dean) are founding fathers behind today’s analytics revolution.