Interconnected global computing systems are generating an enormous amount of irregular, unstructured data. Mining such data for actionable business intelligence can provide an enterprise with a significant competitive advantage. High-productivity programming models that enable programmers to write small pieces of sequential code to analyze massive amounts of data are particularly valuable in mining this data.
Over the last several years, Apache™ Hadoop™ has emerged as an important programming model for processing large data sets. More specifically, Hadoop™ is an open-source, Java™-based software framework that supports the processing of large data sets in a distributed computing environment. Hadoop™ provides for distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop™ can scale up from single servers to thousands of machines, each machine offering local computation and storage.
Hadoop™ includes a storage portion, known as Hadoop Distributed File System (HDFS), and a processing portion called Map Reduce. Map Reduce is a programming model and associated implementation for processing parallelizable problems across large data sets using a large number of computers (nodes). If all nodes are on the same local network and use similar hardware, then these nodes are collectively referred to as a cluster. If the nodes are shared across geographically and administratively distributed systems, and use heterogenous hardware, these nodes are collectively referred to as a grid. Processing can occur on data stored in a file system (unstructured), or on data stored in a database (structured), or on data stored in any combination of file systems and databases.
Hadoop™ splits files into large blocks and distributes them across nodes in a cluster. In response to receiving a data set that is to be processed, Hadoop™ transfers packaged code for each of a plurality of nodes to perform parallel processing of the data set. Map Reduce can take advantage of the locality of data, processing the data in proximity to the place it is stored in order to reduce the distance over which the data must be transmitted. This data locality allows data sets to be processed faster and more efficiently than would be the case in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
Map Reduce includes a plurality of mappers for performing filtering and sorting, a plurality of reducers for performing one or more summary operations, and a Map Reduce framework. The mappers and reducers may be implemented using programmer-supplied code. Map Reduce processes a programming problem by specifying one or more mappers for performing each of a plurality of Map operations, as well as one or more reducers for performing each of a plurality of reduce operations. Each of the mappers is configured for receiving a small chunk of data (typically in the form of pairs of (key,value)), and producing a mapper output in the form of zero or more additional key value pairs. Multiple mappers are executed in parallel on all the available data, resulting in a large collection of (key,value) pairs. These pairs are then sorted and shuffled. Moving the mapper outputs to the reducers is referred to as shuffling. The reducer is used to reduce the set of values associated with a given key. Multiple reducers operate in parallel, one for each generated key.
The key value pairs may be retrieved from, and written to, a distributed, resilient file system such as HDFS. A partitioned input key/value (KV) sequence I is operated on by mappers to produce another KV sequence J, which is then sorted and grouped (“shuffled”) into a sequence of pairs of key/list of values. The list of values for each key is then operated upon by a reducer which may contribute zero or more KV pairs to the output sequence. If the involved data sets are large, they are automatically partitioned across multiple nodes and the operations are applied in parallel.
An illustrative example of a Map operation sorts college students by first name into a plurality of queues. Each of respective first names is assigned to a corresponding queue. An illustrative example of a Reduce operation counts the number of college students in each queue, yielding name frequencies for each of the respective first names. The Map Reduce framework (also referred to herein as a Map Reduce infrastructure or a Map Reduce system) orchestrates parallel processing by marshalling distributed servers, running each of a plurality of tasks in parallel, managing all communication and data transfers between various parts of the system, and providing for redundancy and fault tolerance.
Many software applications exist that are configured for implementing Map Reduce by provide programming or software framework or application programming interfaces for allowing users to program the aforementioned Map Reduce functionality. Though it is common to implement Map Reduce using Java™ code, any programming language can be used in conjunction with Hadoop™ to implement a map to reduce parts of a user program.
The Map Reduce model is a popular choice for implementing big data analytics. Performing timely and cost-effective analytics with “Big Data” is a key ingredient for success in many business, scientific and engineering endeavors. Execution time for any Map Reduce job run is dependent on more than seventy user-configurable parameters. If these parameters are set inappropriately, a significant decrease in performance may be observed. If the user does not specify parameter settings during job submission, then default values—shipped with the model or specified by a system administrator—are used. Good settings for these parameters depend on job, data, and cluster characteristics. Users often run into performance problems caused by lack of knowledge of these parameters. Many practitioners of big data analytics—including computational scientists, systems researchers, and business analysts—would like to use a system that can tune itself and provide good performance automatically. Unfortunately, the “out of the box” performance of Hadoop™ leaves much to be desired, leading to suboptimal use of resources, time, and money. Many users lack the necessary expertise and inclination to tune the MapReduce parameters to obtain an acceptable level of performance.
MapReduce job performance tuning has become an important topic for researchers to explore. Several approaches have been formulated for automatically determining values for a plurality of Map Reduce parameters. These approaches use dynamic tuning, static tuning, or various combinations of static and dynamic tuning. Dynamic tuning requires instrumenting and modifying Hadoop™ source codes to collect dynamic run-time statistics. These statistics are then used to build a performance model for guiding performance tuning.
Starfish has been proposed as a self-tuning tool for improving MapReduce job performance through a combination of static and dynamic program analysis. A cost-based optimization approach is utilized. Starfish operates in two phases: first, profiling a standard workload to gather information; and, second, analyzing the profile to create a set of optimized parameters and executing the result as a new workload. The goal of Starfish is not to achieve the maximum level of peak performance that would be obtainable in the context of a manually-tuned system. Regular Starfish users may rarely see performance close to this peak. Rather, the goal of Starfish is to enable Hadoop™ users and applications to obtain acceptable performance automatically throughout a data lifecycle, without any need for the user to understand and manipulate the many tuning knobs available.
Starfish-based optimization methods are time-consuming and not cost-effective. In an effort to overcome the shortcomings of Starfish, an MRONLINE model has been developed to support online performance tuning through designing an efficient hill climbing algorithm. This algorithm provides a real-time performance monitor and a dynamic configuration. Unfortunately, dynamic tuning models such as Starfish and MRONLINE require users to understand the specific internal workings of an application and customize the tuning based on these specifics. This level of understanding and customization is impossible in many cases. Moreover, it is necessary for the user to collect various statistics for numerous runs from a set of runtime log files. The process of collecting these statistics can be very laborious and time consuming.
Static tuning models capture relationships between tunable parameters and execution characteristics. One illustrative example of a static tuning model is MRTuner which uses a Producer-Transporter-Consumer (PTC) cost model to characterize one or more tradeoffs in Map Reduce parallel execution. While formulating a Map Reduce job execution plan in accordance with the PTC model, it is necessary to ensure that a generation of Map outputs by the Producer, a transportation of Map outputs by the Transporter, and a consumption of Map outputs by the Consumer, all keep pace with one another. MRTuner provides this functionality by using a Catalog Manager (CM) and a Job Optimizer (JBO). The CM is configured for building and managing a catalog for historical jobs, data, and system resources. Statistics in the catalog are collected by a job profiler, a data profiler and a system profiler. To optimize a new Map Reduce job, the JBO calls the CM to find a previous job profile in the catalog that is most similar to the new Map Reduce job to be optimized, as well as related data and system information for the most similar job. Based upon the related data and system information, an estimation process is performed to generate a profile and a plurality of potential execution plans for the new Map Reduce job. Then, the JBO estimates the running time of potential execution plans to identify an optimal execution plan.
Static tuning models such as MRTuner consider execution characteristics. An optimal execution plan is formulated by identifying a similar job profile from the catalog. However, static tuning models such as MRTuner do not consider supply-demand relationships based on resource availability. No mechanism is provided for formalizing resource based supply-demand relations and then performing optimization with default constraints on the parameters.