1. Field
The present invention relates generally to data analysis, and more specifically, to automating at least part of the composition of MapReduce jobs.
2. Description of the Related Art
MapReduce is a programming technique often used to analyze data with a parallel, distributed algorithms executed on a cluster of computers. MapReduce is generally used where computing tasks present 1) massive amounts of data; 2) amenable to parallel processing; 3) in a batch process, e.g., before the results are used. Often, the data being analyzed is large relative to the computing resources available to any one computing node of the computing cluster, e.g., often including datasets in the terabyte or petabyte range. Some MapReduce tools abstract away from the user many aspects of implementing distributed parallel processing, including implementing fault-tolerant distributed file systems, task allocation, and machine-to-machine communications.
Often, non-computer scientists have occasion to address the type of problems for which MapReduce is well suited. For instance, often business analysts, who are generally trained to use tools at higher levels of abstraction than that of most MapReduce implementations, would like to quickly perform fine-grained analyses of very large data sets (e.g., data sets dividing the world into very small atomic units of time, persons, and place). Such users often cannot simply perform a coarser analysis on smaller, a more manageable data set because many useful analyses cannot meaningfully aggregate and support decisions based on frequently useful attributes, like whose bio-sensors indicate they are going to have severe health problems, or which geographic areas have large numbers of soccer parents on Saturday mornings for targeting related ads. Further, these analyses often need to be performed quickly because the properties analyzed are often transient (e.g., evaporating on the order of minutes or hours) and analysts would like to act soon after a situation changes.
While business analysts are often relatively sophisticated consumers of mathematical and statistical models, they are often not trained on how to manipulate large data sets quickly on massively parallel computing architectures, like many MapReduce implementations. Further, many users are accustomed to imperative programming techniques and struggle with more functional programming models, like are often used in MapReduce implementations. As a result, use of MapReduce entails a relatively high level of skill relative to many members of the potential user base.
Moreover, certain existing tools for abstracting MapReduce implementations away from the user, such as the Apache Hive Query Language, are not well suited for certain kinds of data analysis. These tools frequently fail to account for ground-truth relationships between key-value pairs that manifest themselves in many frequent lines of inquiry. For instance, index keys often correspond to geolocations or times, and key-value pairs near one another in time or place are often relevant to a given portion of an analysis performed on a single computing node. Such relationships often give rise to optimizations in allocation of data that make analyses computationally feasible, or at least less expensive in computation resources. But users often struggle to properly allocate input data among nodes in a compute cluster to benefit from such optimizations. As a result, users are often burdened with, or incapable of, configuring compute jobs to account for these realities.