The ability to process large amounts of data within shorter periods of time is growing in importance. For one thing, more and more data is being produced as more mobile technologies with larger information-sensing capacities are spreading, more people interact over the internet via social media, and more devices become equipped with smart technologies, among other reasons. Some of such sources include web searches, email, logs, internet marketing, geospatial data, financial data, space research, healthcare data, scientific research, and more. Furthermore, the world's ability to store data is increasing, according to one study, for example, the world's per-capita data storage capacity has been doubling every 40 months since the 1980s.
Not only are larger and larger data sets becoming more common, but the processing of such data sets is becoming increasingly important in more areas. Large data sets are frequently involved in several areas of research from meteorology, to genetics, to many other fields of research requiring complex modeling. The ability to process large amounts of data has also become important in more every-day applications from finance, marketing, e-commerce, social media, and internet searches.
To make possible the processing of large data sets, often the presence of multiple chunks that can be processed independently is leveraged to break up the job for parallel processing. Parallel processing can occur on several nodes, or machines, simultaneously, greatly speeding up the underlying job. However, sending large amounts of data across a network for processing can introduce its own complexities, take time, and occupy large amounts of bandwidth within a network. Many other complex problems arise in distributed processing generally, such as the details of the parallel and distributed processing itself and the handling of errors during processing.
In the late 1990s and early 2000s, in the process of addressing problems associated with indexing the massive amounts of information that its search engine relies on, Google noticed several features common to many big-data processing problems. As a result, it developed a distributed file system, the Google File System (GFS), that provides a framework for breaking-up and storing large data sets and lends itself to the processing of those large data sets. Additionally, Google developed a framework, known as the MapReduce framework, for processing distributed data sets implemented in two main phases. These main phases comprise a map phase that takes input files with key value pairs and produces intermediate files with new key value pairs and a reduce phase that combines values from common keys in the intermediate files.
Collaboration between large corporations and other contributors has led to open source versions of the GFS and MapReduce framework based on papers published by Google in 2003 and 2004. The open source versions are referred to as the Hadoop Distributed File System (HDFS) and Hadoop MapReduce engine, or collectively as simply Hadoop. Whether in terms of Google's version, Hadoop, or some other version, these distributed file systems and MapReduce frameworks have proved a boon to big data processing, in such areas as search, analytical functions, transformations, aggregations, data mining, among others, and have become ubiquitous in the field. However, additional demands, such as those of larger data sets and needs for quicker processing times, require additional innovations that can sit atop Hadoop-like approaches and potentially other approaches to distributed processing.
Furthermore, multiple big-data jobs are often provisioned to a common implementation of a big-data processing technology, such as, but not limited to, a Hadoop-like approach. However, not all big-data jobs have the same requirements and priorities. These differing requirements and priorities have implications for the efficiencies with which the corresponding jobs can be processed, implications arising from the abilities and limitations of the underlying technology used to process the jobs. Recouping efficiencies in light of these implications also requires additional innovations.
The following description and claims set forth innovations in the context of the foregoing discussion.