The ability to process large amounts of data within shorter periods of time is growing in importance. For one thing, more and more data is being produced as more mobile technologies with larger information-sensing capacities are spreading, more people interact over the internet via social media, and more devices become equipped with smart technologies, among other reasons. Some of such sources include web searches, email, logs, internet marketing, geospatial data, financial data, space research, healthcare data, scientific research, and more. Furthermore, the world's ability to store data is increasing, according to one study, for example, the world's per-capita data storage capacity has been doubling every 40 months since the 1980s.
Not only are larger and larger data sets becoming more common, but the processing of such data sets is becoming increasingly important in more areas. Large data sets are frequently involved in several areas of research from meteorology, to genetics, to many other fields of research requiring complex modeling. The ability to process large amounts of data has also become important in more every-day applications from finance, marketing, e-commerce, social media, and internet searches. However, the growing size of data sets that must be processed to support functionalities in these and other areas is often so large that traditional processing approaches are either impractical, or simply impossible.
To make possible the processing of large data sets, often the presence of multiple chunks that can be processed independently is leveraged to break up the job for parallel processing. Parallel processing can occur on several nodes, or machines, simultaneously, greatly speeding up the underlying job. However, sending large amounts of data across a network for processing can introduce its own complexities, take time, and occupy large amounts of bandwidth within a network. Many other complex problems arise in distributed processing generally, such as the details of the parallel and distributed processing itself and the handling of errors during processing.
In the late 1990s and early 2000s, in the process of addressing problems associated with indexing the massive amounts of information that its search engine relies on, Google noticed several features common to many big-data processing problems. As a result, it developed a distributed file system, the Google File System (GFS), that provides a framework for breaking-up and storing large data sets across physically independent commodity machines interlinked by a network and lends itself to the processing of those large data sets. Additionally, Google developed a framework, known as the MapReduce framework, for processing distributed data sets implemented in two main phases. These main phases comprise a map phase that takes input files with key value pairs and produces intermediate files with new key value pairs and a reduce phase that combines values from common keys in the intermediate files.
In 2003 and 2004 Google published its GFS and MapReduce framework respectively in two papers. These papers, together with a lot of collaboration from large corporations and other contributors have led to open source versions of the foregoing system and framework, respectively referred to as the Hadoop Distributed File System (HDFS) and Hadoop MapReduce engine, or collectively as simply Hadoop. Whether in terms of Google's version, Hadoop, or some other version, these distributed file systems and MapReduce frameworks have proved a boon to big data processing, in such areas as search, analytical functions, transformations, aggregations, data mining, among others, and have become ubiquitous in the field. However, additional demands, such as those of larger data sets and needs for quicker processing times, require additional innovations that can sit atop Hadoop-like approaches and potentially other approaches to distributed processing. The following description and claims set forth such innovations.